mercredi 4 septembre 2019

Webscraping from list of URLs in dataframe in R

I have a somewhat complex task that requires looking up a series of URLs which are contained in a dataframe, scraping some data from each URL, and then adding this data back into the original dataframe. Somehow I seem to have solved the most difficult part of this (the scraping part), but I'm having problems with how to automate the task (which I suspect is probably quite straightforward).

Here's the situation: I have a data.frame comprised of 12 variables and 44,000 rows. One of these variables, Programme_Synopsis_url contains the URL to a programme on the BBC iPlayer.

I need to go to that URL, extract one piece of data (details of the channel), and then add it to a new column called Channel.

Here is some sample data (I apologise for the size / complexity of this sample, but I think it's necessary to share this to get the right solution:

df <- structure(list(Title = structure(c(3L, 7L, 5L, 2L, 6L, 6L, 1L, 
4L, 9L, 8L), .Label = c("Asian Provocateur", "Cuckoo", "Dragons' Den", 
"In The Flesh", "Keeping Faith", "Lost Boys? What's Going Wrong For Asian Men", 
"One Hot Summer", "Travels in Trumpland with Ed Balls", "Two Pints of Lager and a Packet of Crisps"
), class = "factor"), Series = structure(c(1L, 1L, 1L, 3L, 1L, 
1L, 2L, 2L, 1L, 1L), .Label = c("", "Series 1-2", "Series 4"), class = "factor"), 
    Programme_Synopsis = structure(c(2L, 5L, 4L, 6L, 1L, 1L, 
    8L, 7L, 9L, 3L), .Label = c("", "1. The Dragons are back - with big money on the table.", 
    "1/3 Proud. Meeting rednecks", "1/8 Faith questions everything when her husband goes missing", 
    "4/6 What Happens in Ibiza... Is Megan really a party animal?", 
    "Box Set. Dale plans to propose – but what does Ken think?", 
    "Box Set. For the undead... life begins again", "Box Set. Romesh... and mum", 
    "Series 1-9. Box Set"), class = "factor"), Programme_Synopsis_url = structure(c(6L, 
    9L, 4L, 8L, 1L, 1L, 3L, 7L, 2L, 5L), .Label = c("", "https://www.bbc.co.uk/iplayer/episode/b00747zt/two-pints-of-lager-and-a-packet-of-crisps-series-1-1-fags-shags-and-kebabs", 
    "https://www.bbc.co.uk/iplayer/episode/b06fq3x4/asian-provocateur-series-1-1-uncle-thiru", 
    "https://www.bbc.co.uk/iplayer/episode/b09rjsq5/keeping-faith-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/b0bdpvhf/travels-in-trumpland-with-ed-balls-series-1-1-proud", 
    "https://www.bbc.co.uk/iplayer/episode/b0bfq7y2/dragons-den-series-16-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p00szzcp/in-the-flesh-series-1-episode-1", 
    "https://www.bbc.co.uk/iplayer/episode/p06f52g1/cuckoo-series-4-1-lawyer-of-the-year", 
    "https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza"
    ), class = "factor"), Programme_Duration = structure(c(6L, 
    4L, 6L, 1L, 6L, 6L, 2L, 5L, 3L, 6L), .Label = c("25 mins", 
    "28 mins", "29 mins", "40 mins", "56 mins", "59 mins"), class = "factor"), 
    Programme_Availability = structure(c(4L, 2L, 1L, 6L, 4L, 
    4L, 5L, 6L, 5L, 3L), .Label = c("Available for 1 month", 
    "Available for 11 months", "Available for 17 days", "Available for 28 days", 
    "Available for 3 months", "Available for 5 months"), class = "factor"), 
    Programme_Category = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 
    1L, 1L, 1L, 1L), .Label = c("Box Sets", "Featured", "Most Popular"
    ), class = "factor"), Programme_Genre = structure(c(4L, 2L, 
    3L, 5L, 2L, 2L, 1L, 3L, 1L, 2L), .Label = c("Comedy", "Documentary", 
    "Drama", "Entertainment", "New SeriesComedy"), class = "factor"), 
    date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
    ), .Label = "13/08/2018", class = "factor"), rank = c(1L, 
    2L, 3L, 4L, 5L, 12L, 1L, 2L, 3L, 4L), row = c(1L, 1L, 1L, 
    1L, 1L, 3L, 4L, 4L, 4L, 4L), Box_Set = structure(c(1L, 1L, 
    1L, 2L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), class = "data.frame", row.names = c(NA, 
-10L))

Just to make things even more complicated(!) there are two different types of URLs. Some point to the episode page for a programme and some point to the main programme page (there is no difference in the URL syntax in order to distinguish between the two). The reason this matters is because the data I want to scrape (name of the channel) is found in different places depending on whether it is a page for an episode, or the main page of a programme. I have written a script that get this data for each of these types of pages:

### First, set URL ###
url <- 'https://www.bbc.co.uk/programmes/b04d72n6'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="br-masthead"]/div/div[1]/a/text()') %>% html_text()

### Confirm Channel details ###
print(channel)


### Get Channel for episode page ###
### First, set URL ###
url <- 'https://www.bbc.co.uk/iplayer/episode/p06fvww2/one-hot-summer-series-1-4-what-happens-in-ibiza'
### Then, locate details of Channel via xpath ###
channel <- url %>%
    read_html() %>%
    html_nodes(xpath = '//*[@id="main"]/nav/div/ul/li[1]/div/div/div/nav/ul/li[3]/a/span/span') %>% html_text()

### Confirm Channel details ###
print(channel)

The question is, how do I automate this, and loop through every URL (some 44,000), extract this data, and then add it to a new column called Channel?

A couple of final concerns / caveats /questions:

  1. Will looking up and scraping data from 44,000 URLs cause any technical problems? I don't want to kill the BBC's servers or get my IP blocked for doing this! I have checked the terms and conditions of their website(s) and there is no mention of scraping that I find.
  2. It might help to point out that although there are around 44,000 rows (of URLs) that I need to check, many of these are duplicates. As such, I wonder if it would be better to begin by creating a new dataframe that removes any duplicates (e.g. based on the Programme_Synopsis_url or Title columns). Doing this would mean that I would need to scrape a much smaller number of URLs, and could then merge this data back into the original dataframe. I.e. if the Title matches, then add variable from Channel column of the streamlined dataframe into a column called Channel in the original dataframe.
  3. I imagine I'm going to have to use some kind of loop with an if/else statement to do this. I.e. IF the URL contains a certain xpath then copy and paste that data into the Channel column for that row, ELSE copy the data from the other xpath and input that into the Channel column for that row. IF a page doesn't contain either xpath (which could be possible) then do nothing.

Hope that's all clearly. Happy to elaborate if necessary.

Aucun commentaire:

Enregistrer un commentaire