Someone was kind enough to help me put together a web scraper for a government website.
The code:
import urllib.request
from pywebcopy import save_webpage
import requests
from bs4 import BeautifulSoup as Soup
url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Laboratory&CycleBeginYear="
year =2018# This variable can be changed to whatever year you want to parse
url = url + str(year) #combined the government url with the chosen year
response = requests.get(url)
response.raise_for_status()
soup = Soup(response.content, "html.parser")
# This class contains all 4 fields in the NHANES table
class Chemical:
def __init__(self,chemical_name,doc_file,data_file,last_updated):
self.chemical_name = chemical_name
self.doc_file = doc_file
self.data_file = data_file
self.last_updated = last_updated
chemicalArray = [] #initating array
for row in soup.find("tbody").find_all("tr"):
name, *files, date = row.find_all("td")
hrefs = [file.a["href"] for file in files] # this is where I run into an error
chemical = Chemical(name.get_text(strip=True),hrefs[0],hrefs[1],date.get_text(strip=True))
chemicalArray.append(chemical)
However for some years there is entries that look like this: 
Sometimes there is no href in certain years because the data file has been withdrawn, I am not sure how to handle this case. Basically I need to figure out how to deal with the case when there is no href in the "a" tag.
Aucun commentaire:
Enregistrer un commentaire