jeudi 29 novembre 2018

Python 2: Statement to check http link and if it finds match move to the next one

So I have following issue:

I have list of domains written in such way:

cat lista-linkow 
http://wp.pl
http://sport.org
http://interia.pl
http://mistrzowie.org
http://detektywprawdy.pl
http://onet.pl

My script is taking each of these links and attaches it to http://archive.org domain, creating list of links similar to this:

cat output
https://web.archive.org/web/20180101033804/http://wp.pl
https://web.archive.org/web/20181121004239/http://wp.pl
[...]
https://web.archive.org/web/20180120205652/http://sport.org
https://web.archive.org/web/20180220185027/http://sport.org
[...]
https://web.archive.org/web/20180101003433/http://interia.pl
https://web.archive.org/web/20181119000201/http://interia.pl
etc...

(of course [...] symbolizes plenty of similar links)

Right now my script is going through this entire list and finds phrase which I am giving as a argument. The problem is, that I need it to take only one unique link (so, interia.pl/... or wp.pl/... or mistrzowie.org/.... etc and if phrase matches there, it moves to the next item from the list.

So, if it finds phrase "Katastrofa" in list of links in "output" file:

python vulture-for-loop-links.py Katastrofa
SUKCESS!! phrase is here:  https://web.archive.org/web/20180101033804/http://wp.pl

http://wp.pl

SUKCESS!! phrase is here:  https://web.archive.org/web/20180113000926/http://wp.pl

http://wp.pl

...I would like it to go to next item in "lista-linkow", so "http://sport.org" and search it there... Here is what I got so far:

from __future__ import print_function
from sys import argv
from selenium import webdriver
import re
import requests
from bs4 import BeautifulSoup


# Wyzerowanie pliku z linkami output:

skrypt, fraza = argv
plik = open("output", "w")
plik.truncate()

# Otwiera liste linkow, i wrzuca ja do archive.org.

lista = open("lista-linkow", "r")
txt = lista.readlines()
driver = webdriver.Chrome()
for i in txt:
    url = 'https://web.archive.org/web/*/{}'.format(i)
    driver.get(url)
    driver.refresh()


    driver.implicitly_wait(1000) # seconds

    captures = driver.find_elements_by_xpath("""//*[@id="wb-calendar"]/div/div/div/div/div/div/div/a""")

# Lapie linki (captures) i zapisuje w wyczyszczonym wczesniej pliku "output"
    for capture in captures:
        stronka = capture.get_attribute("href")
        with open('output', 'a') as plik:
            plik.write(stronka + "\n")
            plik.seek(0)
            plik.close()
#        print(stronka, sep='\n')

# Dla pojedynczej strony z pliku "lista-linkow", otwartej jako "txt":

for elem in txt:
    f = open('output', 'r')
    f.seek(0)
    f2 = f.readlines()

    # Dla kazdej odczytanej linii z pliku output odszukaj frazy:

    for i in f2:
        r = requests.get(i)
        soup = BeautifulSoup(r.text, 'html.parser')
        boxes = soup.find_all(True, text=re.compile(fraza, re.I))


        # Dla odszukanej frazy (d) wypisz link z pliku "output" i podaj link (i)

        for d in boxes:
            if len(i) == 0:
                print ("Fail, phrase not found", i)
            else:
                print ("SUKCESS!! phrase is here: ", i)
                print(elem) 
driver.quit()

Aucun commentaire:

Enregistrer un commentaire