mardi 23 mars 2021

Page number function is wrong in webscraper in R

library(rvest)
library(RCurl)
library(XML)
library(stringr)


#Getting the number of Page

getPageNumber <- function(URL) {
  print(URL)
  parsedDocument <- read_html(URL)
  pageNumber <- parsedDocument %>%
    html_nodes(".al-pageNumber") %>%
    html_text() %>%
    as.integer()
  return(ifelse(length(pageNumber) == 0, 0, max(pageNumber)))
}

findURL <- function(year_chosen){
  if (year_chosen >= 1994) {
    noYearURL <- glue::glue("https://academic.oup.com/dnaresearch/search-results?rg_IssuePublicationDate=01%2F01%2F{year_chosen}%20TO%2012%2F31%2F{year_chosen}")
    pagesURl <- "&fl_SiteID=5275&page="
    URL <- paste(noYearURL, pagesURl, sep = "")
    # URL is working with parameter year_chosen
    firstPage <- getPageNumber(URL)
    paste(firstPage)
    
    if (firstPage == 5) {
      nextPage <- 0
      while (firstPage < nextPage | firstPage != nextPage) {
        firstPage <- nextPage
        URLwithPageNum <- paste(URL, firstPage-1, sep = "")
        nextPage <- getPageNumber(URLwithPageNum)
      }
    }else {
      print("The Year you provide is out of range, this journal only contain articles from 1994 to present")
    }
  }
}
findURL(2018) 

The above code is a part of my webscrape. Mainly what I want to do is get the pages of all the journals given the parameter year. I believe my getPageNumber is wrong as I am only able to get the amount of pages seen from the first page instead of getting all the pages that are given in a year.

my main function is then incorrectly grabbing the urls based off the pages.

I would like to add that the most pages I would like to grab for a year is 5

I would really appreciate any help! Thank you in advance

Aucun commentaire:

Enregistrer un commentaire