mercredi 8 août 2018

How to scrape page headers (ie: html_nodes("h1")) from a list of html files using rvest?

The end goal of this question is to create a data frame of URLs in one column, and the header of that page in another. I'll explain my approach -- but feel free to explain it by using a list of websites instead of html files.

The reason for creating a list of html files is that some of the URLs produce an error when scraping using read_html directly on the list of websites. The try function gets around that.

for (i in 1:nrow(uniques)) { 
    try(download.file(uniques$URL.Found.On[i], 
    destfile = paste("scrapedpage", i, "html", sep = "."), quiet=TRUE)) 
        }

This produces a list of 11k websites. However, perhaps because I used the try function, it created some html files that this next function won't read.

for (i in 1:nrow(uniques)) { 
content[i] <- read_html(paste("scrapedpage", i, "html", sep = ".")) %>% 
html_nodes("h1") %>% html_text()
}

This works for the first three items of my list, so I know I'm on the right track, but it doesn't go through the entire list. I get the following message:

Error in content[i] <- read_html(paste("scrapedpage", i, "html", sep = ".")) %>% : replacement has length zero

Could it be that the 4th html file from this list has no "h1" header, or some other factor that limits the usefulness of this function?

Is there a way to just leave an NA if no "h1" is found, so that it doesn't break the for loop? Maybe adding in an ifelse statement? Any ideas are appreciated.

Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire