I'm trying to search a word from the text that I extract from the pdf file which is OCR'd format. This pdf file has multiple pages, so for each page, I'm searching that word, if that word is found then break and write the filename and status(Present or Not Present) to a dataframe and go on to next page. But the dataframe is giving the status "Present" for all files, I just want like this
file_name Status
test1.pdf "Not Present"
test2.pdf "Not Present"
test3.pdf "Present"
what m I missing in this code.
here is the code
All_files=Sys.glob("*.pdf")
df <- data.frame()
Status="Present"
for (i in 1:length(All_files))
{
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
for(j in 1:cnt)
{
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check=sapply(ocr_text, paste0, collapse="")
junk <- dir(path="D:/All_PDF_Files/", pattern="tiff")
file.remove(junk)
br=if(length(which(stri_detect_fixed(tolower(check),tolower("school")))) <= 0){ print("Not Present") } else {print("Present")}
if(br=="Present")
df <- rbind(df, cbind(file_name, Status))
break
}
}
Any suggestion is appreciable.
Thanks
Aucun commentaire:
Enregistrer un commentaire