I want scrapy to scrape only those urls which are not present in doneUrls variable. However scrapy scraps only those urls which are already present in doneUrls.
keysbase = pd.read_csv('KeysDB.csv', encoding= 'unicode_escape')
doneUrls = keysbase['_2title_url']
......................
def parse(self, response):
titleLinks = response.xpath('//*[@class="lister-item-content"]')
for link in titleLinks:
title_url = response.urljoin(link.xpath('.//h3/a/@href').get())
print (doneUrls)
if title_url not in doneUrls:
print (title_url + 'is not present')
yield scrapy.Request(title_url, callback=self.parse_title,
meta={
'title_url': title_url
})
else:
pass
Everything working perfectly but in opposite direction. If I want to scrap out of 5 urls 1, 3 and 5 for example, the code only extracts urls 2 and 4.
I tried if title_url in doneUrls: but it doesn't help either and gives totally empty result.
Aucun commentaire:
Enregistrer un commentaire