mercredi 17 novembre 2021

Scrapy "if not in" statement working opposite

I want scrapy to scrape only those urls which are not present in doneUrls variable. However scrapy scraps only those urls which are already present in doneUrls.

keysbase = pd.read_csv('KeysDB.csv', encoding= 'unicode_escape')
doneUrls = keysbase['_2title_url']

......................

def parse(self, response):
    titleLinks = response.xpath('//*[@class="lister-item-content"]') 
            
    for link in titleLinks:
        title_url = response.urljoin(link.xpath('.//h3/a/@href').get())
        print (doneUrls)
        if title_url not in doneUrls:
            print (title_url + 'is not present')
            yield scrapy.Request(title_url, callback=self.parse_title, 
            meta={
                'title_url': title_url
                })
        else:
            pass

Everything working perfectly but in opposite direction. If I want to scrap out of 5 urls 1, 3 and 5 for example, the code only extracts urls 2 and 4.

I tried if title_url in doneUrls: but it doesn't help either and gives totally empty result.

Aucun commentaire:

Enregistrer un commentaire