I've got a simple Map
function that scrapes text files from a blog site. It's pretty easy to get a scraper that gets all of the text files and downloads them to my working directory. My goal: use an ifelse()
or a plain if
statement to only scrape a file based on a certain date.
Eg, if four files were posted on 1/31/19, and I pointed my ifelse
at that date, the function would return those four files. Code:
library(tidyverse)
library(rvest)
# URL set up
url <- "https://www.example-blog/posts.aspx"
page <- html_session(url, config(ssl_verifypeer = FALSE))
# Picking elements
links <- page %>%
html_nodes("td") %>%
html_nodes("a") %>%
html_attr("href")
# Getting date elements
dates <- page %>%
html_nodes("node.dates") %>%
html_text()
dates <- parse_date_time(dates, "%m/%d/%Y", tz = "EST",
locale = Sys.getlocale("LC_TIME"))
# Function
out <- Map(function(ln) {
fun1 <- html_session(URLencode(
paste0("https://www.example-blog", ln)),
config(ssl_verifypeer = FALSE))
write <- writeBin(fun1$response$content)
ifelse(dates == '2019-01-31', write, "He's dead, Jim")
}, links)
I've tried various ways to get that if statement in there, and also moving the writeBin
around. (Usually the writeBin
would not be vectorized - I did it for easy viewing in my ifelse
).
If I leave out the if
code, everything works great, it just returns many text files, when I only want the ones from the specified date.
Aucun commentaire:
Enregistrer un commentaire