I am trying to identify strings in a for loop that contain 'https://'. These strings are fine and I want to just write them to a .csv file. For those strings that DO NOT contain the 'https://' I want to add the URL to the string before I write the string into a .csv file.
I'm not sure if I'm placing the different parts of the code in the right order, and at this point I've moved things around so much, I feel like I've tried it all.
from bs4 import BeautifulSoup
import requests
import csv
import re
web_page = 'https://www.commerce.gov/data-and-reports'
source = requests.get(web_page).text
soup = BeautifulSoup(source, "html.parser")
locator = 'https://www.commerce.gov'
link_list = []
csv_file = open('testing.csv', 'w')
csv = csv.writer(csv_file)
csv.writerow(['URIs'])
for links in soup.find_all('a'):
href = links.get('href')
if href != None and href != '' and href != '#main-content' and href != '#':
x = re.compile('ht+ps?')
y = x.match('https')
if not y:
href = locator + href
csv.writerow([href])
else:
csv.writerow([href])
csv_file.close()
What I want to do is to add 'https://www.commerce.gov' to the beginning of the href strings that don't contain 'https' and then write them to the csv file.
Aucun commentaire:
Enregistrer un commentaire