jeudi 9 juin 2016

Yield only desired URLS with if statement/string comparison

I have a spider that crawls a given url and extracts the links for all the location pages. I run the command "scrapy crawl lkqlist -o urls.csv -t csv" from the command prompt and it saves all these urls in a .csv file.

However some of the location pages I don't want to yield in my list of saved urls, specifically all the urls ending in ".aspx". I am trying to sort them out with an if statement but it is still yielding ".aspx" containing urls. Should I place this string comparison somewhere else in the loop to get my desired result or should I use a different method altogether?

My code:

lkqlist = 'http://ift.tt/22W8mbf' 
class JunkYardSites(scrapy.Item):
    Sites = scrapy.Field()

class LkqLocationList(scrapy.Spider):
    name = "lkqlist"
    allowed_domains = ["lkqcorp.com"]
    start_urls = (
    lkqlist,    
)
    def parse(self, response):
        sites =    response.xpath("//td[@class='basicviewbold']/script/text()").re(r'(?:localizeStoreUrlToCulture\(\")(.*)(?:\", local.culture)')

        for element in range(0, len(sites), 1):
            if ".aspx" not in sites.pop(0):
                urls = JunkYardSites()
                urls["Sites"] = sites.pop(0) 
                yield urls

Current file output to urls.csv:

Sites
http://ift.tt/1UiQPrw http://ift.tt/1GQq6Zu http://ift.tt/1UiRsBm http://ift.tt/22W7BPs http://ift.tt/1UiQRj8 http://ift.tt/22W8iIq http://ift.tt/1UiQ8ys http://ift.tt/22W8lUL http://ift.tt/1UiQa9C http://ift.tt/22W8s2o http://ift.tt/1UiQaq4 http://ift.tt/22W7VNZ http://ift.tt/1UiR0mL http://ift.tt/22W8tU2 http://ift.tt/1UiQHIH http://ift.tt/22W8lnA http://ift.tt/1UiQVPM http://ift.tt/22W8qHK http://ift.tt/1UiRj0I http://ift.tt/22W7ymI http://ift.tt/1UiQwwL http://ift.tt/22W7Ycx http://ift.tt/1UiQDIV http://ift.tt/22W7Z09 http://ift.tt/1UiR3yJ http://ift.tt/22W8pDK

Aucun commentaire:

Enregistrer un commentaire