if-statement: How to deal with standardized html having abnormal entry

mercredi 11 août 2021

How to deal with standardized html having abnormal entry

Someone was kind enough to help me put together a web scraper for a government website.

The code:

import urllib.request
from pywebcopy import save_webpage
import requests
from bs4 import BeautifulSoup as Soup


url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Laboratory&CycleBeginYear="

year =2018# This variable can be changed to whatever year you want to parse
url = url + str(year) #combined the government url with the chosen year

response = requests.get(url)
response.raise_for_status()

soup = Soup(response.content, "html.parser")


# This class contains all 4 fields in the NHANES table
class Chemical:

        def __init__(self,chemical_name,doc_file,data_file,last_updated):
            self.chemical_name = chemical_name
            self.doc_file = doc_file
            self.data_file = data_file
            self.last_updated = last_updated
    
    
    chemicalArray = [] #initating array
    
    
    for row in soup.find("tbody").find_all("tr"):
        name, *files, date = row.find_all("td")
        hrefs = [file.a["href"] for file in files] # this is where I run into an error
        chemical = Chemical(name.get_text(strip=True),hrefs[0],hrefs[1],date.get_text(strip=True))
        chemicalArray.append(chemical)

However for some years there is entries that look like this:

Sometimes there is no href in certain years because the data file has been withdrawn, I am not sure how to handle this case. Basically I need to figure out how to deal with the case when there is no href in the "a" tag.

if-statement

mercredi 11 août 2021

How to deal with standardized html having abnormal entry

Aucun commentaire:

Enregistrer un commentaire