I've written a script to parse a link available within the visible text contact or about from each webpage. However, when I run my script I can see that my scraper always goes for parsing the link within about. It parses the link within contact only when about is not available. How can i make my script do the opposite, I meant it will look for the link connected to contact instead of about. If contact is not available then only it will parse about. I tried the below way to get it done but it is doing the way I described.
This is my try:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
links = (
"http://www.mount-zion.biz/",
"http://www.latamcham.org/",
"http://www.innovaprint.com.sg/",
"http://www.cityscape.com.sg/"
)
def Get_Link(site):
res = requests.get(site)
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("a[href]"):
if "contact" in item.text.lower():
abslink = urljoin(site,item['href']) ##I thought the script prioritizes the first condition but I am wrong
print(abslink)
break
else:
if "about" in item.text.lower():
abslink = urljoin(site,item['href'])
print(abslink)
break
if __name__ == '__main__':
for link in links:
Get_Link(link)
Is there any way to prioritize a condition based on it's availability? The bottom line is I wanna get the link connected to contact. if it is not available then the script will look for the link connected to about.
Aucun commentaire:
Enregistrer un commentaire