mardi 31 juillet 2018

How to Nest If Statement Within For Loop When Scraping Div Class HTML

Below is a scraper that uses Beautiful Soup to scrape physician information off of this webpage. As you can see from the html code directly below, each physician has an individual profile on the webpage that displays the physician's name, clinic, profession, taxonomy, and city.

<div class="views-field views-field-title practitioner__name" ><a href="/practitioners/41824">Marilyn Adams</a></div>
              <div class="views-field views-field-field-pract-clinic practitioner__clinic" ><a href="/clinic/fortius-sport-health">Fortius Sport &amp; Health</a></div>
              <div class="views-field views-field-field-pract-profession practitioner__profession" >Physiotherapist</div>
              <div class="views-field views-field-taxonomy-vocabulary-5 practitioner__region" >Fraser River Delta</div>
              <div class="views-field views-field-city practitioner__city" ></div>

As you can see from the sample html code, the physician profiles occasionally have information missing. If this occurs, I would like the scraper to print 'N/A'. I need the scraper to print 'N/A' because I would eventually like to put each div class category (name, clinic, profession, etc.) into an array where the lengths of each column are exactly the same so I can properly export the data to a CSV file. I have tried writing an if statement nested within each for loop, but the code does not seem to be looping correctly as the "N/A" only shows up once for each div class section. Does anyone know how to properly nest an if statement with a for loop so I am getting the proper amount of "N/As" in each column? Thanks in advance!

import requests
import re
from bs4 import BeautifulSoup

page=requests.get('https://sportmedbc.com/practitioners')
soup=BeautifulSoup(page.text, 'html.parser')

#Find Doctor Info

for doctor in soup.find_all('div',attrs={'class':'views-field views-field-title practitioner__name'}):
    for a in doctor.find_all('a'):
        print(a.text)

for clinic_name in soup.find_all('div',attrs={'class':'views-field views-field-field-pract-clinic practitioner__clinic'}):
    for b in clinic_name.find_all('a'):
        if b==(''):
            print('N/A')

profession_links=soup.findAll('div',attrs={'class':'views-field views-field-field-pract-profession practitioner__profession'})
for profession in profession_links:
    if profession.text==(''):
        print('N/A')
    print(profession.text)

taxonomy_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for taxonomy in taxonomy_links:
    if taxonomy.text==(''):
        print('N/A')
    print(taxonomy.text)

city_links=soup.findAll('div',attrs={'class':'views-field views-field-taxonomy-vocabulary-5 practitioner__region'})
for city in city_links:
    if city.text==(''):
        print('N/A')
    print(city.text)

Aucun commentaire:

Enregistrer un commentaire