dimanche 16 juin 2019

If-condition is not executed in a for-loop when scraping data from kworb.net

I am pretty new to Python and have started using it to perform some web scraping. More specifically, I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.

So the aim of my code is to create a table with two columns: 1. artist name; 2. country where the artist is streamed the most. I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artists name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in. I am using the following if-condition:

if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
    Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
    Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)

But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, it tried the reverse condition:

if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
    Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
    Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)

But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".

This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):

from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd

response1 = get('https://kworb.net/spotify/artists.html', headers = headers)

soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396]

artist = []
country = []

for row in rows:
    artist_url = row.find('a')['href']

    response2 = get('https://kworb.net/spotify/' + artist_url)

    sleep(randint(8,15))

    soup2 = bs(response2.text, 'html.parser')

    Artist = soup2.find('title').text[:-24]
    artist.append(Artist)

    if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
        Country = soup2.find_all('table')[0].find_all('th')[4].text
    else:
        Country = soup2.find_all('table')[0].find_all('th')[5].text
    country.append(Country)

df = pd.DataFrame({'Artist': artist,
                   'Country': country
})

print(df)

As a result, I get the following:

             Artist Country
0         YNW Melly  Global
1  Henrique & Diego      BR
2              LANY  Global
3      Parson James  Global
4       ANAVITÃRIA      BR

While the actual output, as of June 16, 2019, should be:

             Artist Country
0         YNW Melly      US
1  Henrique & Diego      BR
2              LANY      PH
3      Parson James      US
4       ANAVITÃRIA      BR

I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.

Thanks!

Aucun commentaire:

Enregistrer un commentaire