if-statement: Scraping a particular URL host and primary path and query string for all possible additional extensions and appending these to a list

I want to scrape a web domain to get all possible extensions of a particular path and query string. Here is one original url host+path+query string from this web domain, http://ift.tt/2tQbpKo.

url_passing1 = 'http://ift.tt/2wvIivR'

This url is the result of a query string for all players in the National Football League (NFL) who have attempted at least 25 passes in their careers. There are, according to this website, 881 such players in the history of the NFL who have attempted at least 25 passes.

But this url only contains the statistical table data for first 100 players on this list. So there are 7 additional urls with 100 players each and one additional url with 81 players.

The url for the 2nd url from this query contains a table with players 101-200 is here:

url_passing2 = 'http://ift.tt/2vW2H9H'

Notice that these are exactly the same until the last part, where there is the additional extension string '&offset=100'. Each additional page has the same host/path/query string plus '&offset=200', '&offset=300', '&offset=400', and so on up to '&offset=800'.

I want to create a Python function that will start with some primary host/path/query string from this web domain and collect a list of all of the possible iterations of this host/path/query string.

Importantly, I'd like this to be a generalizable function such that my original starting query can be different and there will be some unknown number of possible extensions (for example, perhaps a query will produce a list of 1,457 players, so I would then need to scrape for the list of 15 urls that contain all of the data for this query.

The following is what I have so far for my attempt at creating the function. Notice that you may have a better idea for how to get a complete list of all of the urls from a query. My approach is to try to iterate through the urls and check if there is a table on the url or not. The idea is that "if" there is a table on the page, append the url to my output list, and if there is not a table on the page, exit the function.

def get_url_offsets(frontpage_url):
    url_offset = ''
    output_list = [frontpage_url]
    x = 0

    for output_list[x] in output_list:
        results_table = pd.read_html(output_list[x])
        table_results = pd.DataFrame(results_table)

        if table_results.empty == False:
            output_list.append(frontpage_url + url_offset)
            x+=1
            url_offset = '&offset=' + '%d' % (100 * x)
            output_list.append(frontpage_url + url_offset)            
        else:
            exit
        return output_list[1:-1]

if-statement

dimanche 3 septembre 2017

Scraping a particular URL host and primary path and query string for all possible additional extensions and appending these to a list

Aucun commentaire:

Enregistrer un commentaire