mercredi 11 août 2021

How to deal with standardized html having abnormal entry

Someone was kind enough to help me put together a web scraper for a government website.

The code:

import urllib.request
from pywebcopy import save_webpage
import requests
from bs4 import BeautifulSoup as Soup


url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Laboratory&CycleBeginYear="

year =2018# This variable can be changed to whatever year you want to parse
url = url + str(year) #combined the government url with the chosen year

response = requests.get(url)
response.raise_for_status()

soup = Soup(response.content, "html.parser")


# This class contains all 4 fields in the NHANES table
class Chemical:

        def __init__(self,chemical_name,doc_file,data_file,last_updated):
            self.chemical_name = chemical_name
            self.doc_file = doc_file
            self.data_file = data_file
            self.last_updated = last_updated
    
    
    chemicalArray = [] #initating array
    
    
    for row in soup.find("tbody").find_all("tr"):
        name, *files, date = row.find_all("td")
        hrefs = [file.a["href"] for file in files] # this is where I run into an error
        chemical = Chemical(name.get_text(strip=True),hrefs[0],hrefs[1],date.get_text(strip=True))
        chemicalArray.append(chemical)

However for some years there is entries that look like this: enter image description here

Sometimes there is no href in certain years because the data file has been withdrawn, I am not sure how to handle this case. Basically I need to figure out how to deal with the case when there is no href in the "a" tag.

Aucun commentaire:

Enregistrer un commentaire