vendredi 17 août 2018

Find word on first page in list element within a dataframe

I have multiple text documents stored in a list variable in a data frame. On the first page of each document, the title of that doc is stated. What I'm trying to do is to check whether a word is in the content of every first page (title page) of each document and return yes or no. Probably with an if_else statement.

My data frame looks like the following (but then with 300 rows):

# A tibble: 294 x 3
   document                                text       pages
   <chr>                                   <list>     <int>
 1 's-Gravenhage_coalitieakkoord.pdf       <chr [88]>    88
 2 Aa en Hunze_coalitieakkoord.pdf         <chr [10]>    10
 3 Achtkarspelen_coalitieakkoord.pdf       <chr [26]>    26
 4 Alblasserdam_coalitieakkoord.pdf        <chr [13]>    13
 5 Albrandswaard_coalitieakkoord.pdf       <chr [16]>    16
 6 Alkmaar_coalitieakkoord.pdf             <chr [16]>    16
 7 Almelo_coalitieakkoord.pdf              <chr [32]>    32
 8 Almere_coalitieakkoord.pdf              <chr [18]>    18
 9 Alphen aan den Rijn_coalitieakkoord.pdf <chr [32]>    32
10 Alphen-Chaam_raadsakkoord.pdf           <chr [7]>      7
# ... with 284 more rows

The following code will check whether the word is in the whole doc. How can I check for each first page?

data_frame(document = names,        
           text = pdfs_text,        
           pages = lengths(text),   
           word_in_title = if_else(grepl("duurzaam|Duurzaam", text),
                                   "yes", "no"))

   document                                text       pages word_in_title
   <chr>                                   <list>     <int> <chr>        
 1 's-Gravenhage_coalitieakkoord.pdf       <chr [88]>    88 yes          
 2 Aa en Hunze_coalitieakkoord.pdf         <chr [10]>    10 yes          
 3 Achtkarspelen_coalitieakkoord.pdf       <chr [26]>    26 yes          
 4 Alblasserdam_coalitieakkoord.pdf        <chr [13]>    13 yes          
 5 Albrandswaard_coalitieakkoord.pdf       <chr [16]>    16 no           
 6 Alkmaar_coalitieakkoord.pdf             <chr [16]>    16 yes 

Probably also best to create a function and not hard code in the creation of the data frame because then I can check multiple words.

All help much appreciated!

Aucun commentaire:

Enregistrer un commentaire