jeudi 22 octobre 2015

Excluding scraped results that aren't of a specific format using string operations or regex

Hi I am developing a program that is scraping songs form a website and putting them into a list. This is my code so far

from bs4 import BeautifulSoup
import urllib2
from collections import namedtuple

url='http://ift.tt/1GkOjg0'
page = urllib2.urlopen(url)


soup = BeautifulSoup(page.read())

songs=[]
Song = namedtuple("Song", "artist name album")
for link in soup.find_all("li", class_="song"):
    song = Song._make(link.text.strip()[12:].split(" - "))
    songs.append(song)

for song in songs:
    print(song.artist, song.name, song.album)

It works well if the results are of the format

<li class="song"> <a href="/default.htm" onclick="return clickreturnvalue()" onmouseover="dropdownmenu(this, event, menu1, '100px','Jason &amp; The Scorchers','I Really Don\'t Want To Know','Lost &amp; Found')" onmouseout="delayhidemenu()">Buy</a>  Jason &amp; The Scorchers - I Really Don't Want To Know - Lost &amp; Found</li>

But doesn't work if the results are of the format.

<li class="song">|World Cafe| - Thursday 10-22-2015 Hour 2, Part 7 - Host: David Dye</li>

I get an error because there are only two " - "

TypeError                                 Traceback (most recent call last)
<ipython-input-28-1a0a99934b5c> in <module>()
     12 Song = namedtuple("Song", "artist name album")
     13 for link in soup.find_all("li", class_="song"):
---> 14     song = Song._make(link.text.strip()[12:].split(" - "))
     15     songs.append(song)
     16 

<string> in _make(cls, iterable, new, len)

TypeError: Expected 3 arguments, got 2

How do I modify this to exclude any results that aren't of the correct format?

Aucun commentaire:

Enregistrer un commentaire