dimanche 24 septembre 2017

Error in writing data frame in R

I'm trying to search a word from the text that I extract from the pdf file which is OCR'd format. This pdf file has multiple pages, so for each page, I'm searching that word, if that word is found then break and write the filename and status(Present or Not Present) to a dataframe and go on to next page. But the dataframe is giving the status "Present" for all files, I just want like this

file_name       Status
test1.pdf    "Not Present"
test2.pdf    "Not Present"
test3.pdf    "Present"

what m I missing in this code.

here is the code

All_files=Sys.glob("*.pdf")
df <- data.frame()
Status="Present"
for (i in 1:length(All_files))

{
  file_name <- All_files[i]

    cnt <- pdf_info(All_files[i])$pages
    for(j in 1:cnt)
    {
      img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
      text <- ocr(img_file)
      ocr_text <- capture.output(cat(text))
      check=sapply(ocr_text, paste0, collapse="")
      junk <- dir(path="D:/All_PDF_Files/", pattern="tiff")
      file.remove(junk)
      br=if(length(which(stri_detect_fixed(tolower(check),tolower("school")))) <= 0){ print("Not Present") } else {print("Present")}
      if(br=="Present")
        df <- rbind(df, cbind(file_name, Status))
        break
    }

  }

Any suggestion is appreciable.

Thanks

Aucun commentaire:

Enregistrer un commentaire