lundi 4 septembre 2017

Scraping a URL host and primary path plus query string to produce a list of all possible additional extensions for that query

I am working on a project to to scrape table data from a url.

The main web domain is http://ift.tt/2tQbpKo. I have already written the code to scrape the table data from this domain.

A search for statistical data on this might start with a query with set parameters. For example: here is the url with table data for all players in the National Football League who have thrown at least 25 passes in their careers:

url_passing1 = 'http://ift.tt/2wvIivR'

But this url only contains the statistical table data for players 1 - 100 on this list. So there are 7 additional urls with 100 players each and one additional url with 81 players.

The url for the 2nd url from this query contains a table with players 101-200 is here:

url_passing2 = 'http://ift.tt/2vW2H9H'

Notice that these are exactly the same until the very last part, where there is the additional extension string '&offset=100'. Each additional page has the same host/path/query string plus '&offset=200', '&offset=300', '&offset=400', and so on up to '&offset=800'.

My question is this: starting with a url like this, how can I create a Python function that will collect a list of all of the possible url iterations from this host/path/query string, so that I can get the entire list of players who match this query?

My desired output would be a list that looks something like this:

list_or_urls: ['http://ift.tt/2wvIivR', 'http://ift.tt/2vW2H9H', 'http://ift.tt/2x59lQg', 'http://ift.tt/2xJ7xJy', 'http://ift.tt/2x57CdI', 'http://ift.tt/2xJabz7', 'http://ift.tt/2x59nrm', 'http://ift.tt/2xJvD6T', 'http://ift.tt/2x59ovq']

Or, more concisely:

list of urls = ['&offset=0', '&offset=100', '&offset=200', '&offset=300', '&offset=400', '&offset=500', '&offset=600', '&offset700', '&offset=800']

The following is what I have so far for my attempt at creating the function. My approach is to try to iterate through the urls and check if there is a table on the url or not. The idea is that "if" there is a table on the page, append the url to my output list, and if there is not a table on the page, exit the function. But this only produces a list of the first two urls -- it's not looping back to append the last 7 urls for the list.

input_url = 'http://ift.tt/2wvIivR'

def get_url_list(frontpage_url):
    url_offset = ''
    output_list = [frontpage_url]
    x = 0

    for output_list[x] in output_list:
        results_table = pd.read_html(output_list[x])
        table_results = pd.DataFrame(results_table)

        if table_results.empty == False:
            output_list.append(frontpage_url + url_offset)
            x+=1
            url_offset = '&offset=' + '%d' % (100 * x)
            output_list.append(frontpage_url + url_offset)            
        else:
            exit
        return output_list[1:-1]

Aucun commentaire:

Enregistrer un commentaire