I have a spider that crawls a given url and extracts the links for all the location pages. I run the command "scrapy crawl lkqlist -o urls.csv -t csv" from the command prompt and it saves all these urls in a .csv file.
However some of the location pages I don't want to yield in my list of saved urls, specifically all the urls ending in ".aspx". I am trying to sort them out with an if statement but it is still yielding ".aspx" containing urls. Should I place this string comparison somewhere else in the loop to get my desired result or should I use a different method altogether?
My code:
lkqlist = 'http://ift.tt/22W8mbf'
class JunkYardSites(scrapy.Item):
Sites = scrapy.Field()
class LkqLocationList(scrapy.Spider):
name = "lkqlist"
allowed_domains = ["lkqcorp.com"]
start_urls = (
lkqlist,
)
def parse(self, response):
sites = response.xpath("//td[@class='basicviewbold']/script/text()").re(r'(?:localizeStoreUrlToCulture\(\")(.*)(?:\", local.culture)')
for element in range(0, len(sites), 1):
if ".aspx" not in sites.pop(0):
urls = JunkYardSites()
urls["Sites"] = sites.pop(0)
yield urls
Current file output to urls.csv:
Sites
http://ift.tt/1UiQPrw http://ift.tt/1GQq6Zu http://ift.tt/1UiRsBm http://ift.tt/22W7BPs http://ift.tt/1UiQRj8 http://ift.tt/22W8iIq http://ift.tt/1UiQ8ys http://ift.tt/22W8lUL http://ift.tt/1UiQa9C http://ift.tt/22W8s2o http://ift.tt/1UiQaq4 http://ift.tt/22W7VNZ http://ift.tt/1UiR0mL http://ift.tt/22W8tU2 http://ift.tt/1UiQHIH http://ift.tt/22W8lnA http://ift.tt/1UiQVPM http://ift.tt/22W8qHK http://ift.tt/1UiRj0I http://ift.tt/22W7ymI http://ift.tt/1UiQwwL http://ift.tt/22W7Ycx http://ift.tt/1UiQDIV http://ift.tt/22W7Z09 http://ift.tt/1UiR3yJ http://ift.tt/22W8pDK
Aucun commentaire:
Enregistrer un commentaire