So I have following issue:
I have list of domains written in such way:
cat lista-linkow
http://wp.pl
http://sport.org
http://interia.pl
http://mistrzowie.org
http://detektywprawdy.pl
http://onet.pl
My script is taking each of these links and attaches it to http://archive.org domain, creating list of links similar to this:
cat output
https://web.archive.org/web/20180101033804/http://wp.pl
https://web.archive.org/web/20181121004239/http://wp.pl
[...]
https://web.archive.org/web/20180120205652/http://sport.org
https://web.archive.org/web/20180220185027/http://sport.org
[...]
https://web.archive.org/web/20180101003433/http://interia.pl
https://web.archive.org/web/20181119000201/http://interia.pl
etc...
(of course [...] symbolizes plenty of similar links)
Right now my script is going through this entire list and finds phrase which I am giving as a argument. The problem is, that I need it to take only one unique link (so, interia.pl/... or wp.pl/... or mistrzowie.org/.... etc and if phrase matches there, it moves to the next item from the list.
So, if it finds phrase "Katastrofa" in list of links in "output" file:
python vulture-for-loop-links.py Katastrofa
SUKCESS!! phrase is here: https://web.archive.org/web/20180101033804/http://wp.pl
http://wp.pl
SUKCESS!! phrase is here: https://web.archive.org/web/20180113000926/http://wp.pl
http://wp.pl
...I would like it to go to next item in "lista-linkow", so "http://sport.org" and search it there... Here is what I got so far:
from __future__ import print_function
from sys import argv
from selenium import webdriver
import re
import requests
from bs4 import BeautifulSoup
# Wyzerowanie pliku z linkami output:
skrypt, fraza = argv
plik = open("output", "w")
plik.truncate()
# Otwiera liste linkow, i wrzuca ja do archive.org.
lista = open("lista-linkow", "r")
txt = lista.readlines()
driver = webdriver.Chrome()
for i in txt:
url = 'https://web.archive.org/web/*/{}'.format(i)
driver.get(url)
driver.refresh()
driver.implicitly_wait(1000) # seconds
captures = driver.find_elements_by_xpath("""//*[@id="wb-calendar"]/div/div/div/div/div/div/div/a""")
# Lapie linki (captures) i zapisuje w wyczyszczonym wczesniej pliku "output"
for capture in captures:
stronka = capture.get_attribute("href")
with open('output', 'a') as plik:
plik.write(stronka + "\n")
plik.seek(0)
plik.close()
# print(stronka, sep='\n')
# Dla pojedynczej strony z pliku "lista-linkow", otwartej jako "txt":
for elem in txt:
f = open('output', 'r')
f.seek(0)
f2 = f.readlines()
# Dla kazdej odczytanej linii z pliku output odszukaj frazy:
for i in f2:
r = requests.get(i)
soup = BeautifulSoup(r.text, 'html.parser')
boxes = soup.find_all(True, text=re.compile(fraza, re.I))
# Dla odszukanej frazy (d) wypisz link z pliku "output" i podaj link (i)
for d in boxes:
if len(i) == 0:
print ("Fail, phrase not found", i)
else:
print ("SUKCESS!! phrase is here: ", i)
print(elem)
driver.quit()
Aucun commentaire:
Enregistrer un commentaire