mercredi 2 janvier 2019

Using RE in a for loop to identify strings of interest and then adding text to the strings that don't match the RE

I am trying to identify strings in a for loop that contain 'https://'. These strings are fine and I want to just write them to a .csv file. For those strings that DO NOT contain the 'https://' I want to add the URL to the string before I write the string into a .csv file.

I'm not sure if I'm placing the different parts of the code in the right order, and at this point I've moved things around so much, I feel like I've tried it all.

from bs4 import BeautifulSoup
import requests
import csv
import re

web_page = 'https://www.commerce.gov/data-and-reports'
source = requests.get(web_page).text
soup = BeautifulSoup(source, "html.parser")
locator = 'https://www.commerce.gov'
link_list = []

csv_file = open('testing.csv', 'w')
csv = csv.writer(csv_file)
csv.writerow(['URIs'])

for links in soup.find_all('a'):
    href = links.get('href')
    if href != None and href != '' and href != '#main-content' and href != '#': 
        x = re.compile('ht+ps?')
        y = x.match('https')
        if not y:
            href = locator + href
           csv.writerow([href])
        else:
            csv.writerow([href])

csv_file.close()

What I want to do is to add 'https://www.commerce.gov' to the beginning of the href strings that don't contain 'https' and then write them to the csv file.

Aucun commentaire:

Enregistrer un commentaire