I have multiple text documents stored in a list variable in a data frame. On the first page of each document, the title of that doc is stated. What I'm trying to do is to check whether a word is in the content of every first page (title page) of each document and return yes or no. Probably with an if_else statement.
My data frame looks like the following (but then with 300 rows):
# A tibble: 294 x 3
document text pages
<chr> <list> <int>
1 's-Gravenhage_coalitieakkoord.pdf <chr [88]> 88
2 Aa en Hunze_coalitieakkoord.pdf <chr [10]> 10
3 Achtkarspelen_coalitieakkoord.pdf <chr [26]> 26
4 Alblasserdam_coalitieakkoord.pdf <chr [13]> 13
5 Albrandswaard_coalitieakkoord.pdf <chr [16]> 16
6 Alkmaar_coalitieakkoord.pdf <chr [16]> 16
7 Almelo_coalitieakkoord.pdf <chr [32]> 32
8 Almere_coalitieakkoord.pdf <chr [18]> 18
9 Alphen aan den Rijn_coalitieakkoord.pdf <chr [32]> 32
10 Alphen-Chaam_raadsakkoord.pdf <chr [7]> 7
# ... with 284 more rows
The following code will check whether the word is in the whole doc. How can I check for each first page?
data_frame(document = names,
text = pdfs_text,
pages = lengths(text),
word_in_title = if_else(grepl("duurzaam|Duurzaam", text),
"yes", "no"))
document text pages word_in_title
<chr> <list> <int> <chr>
1 's-Gravenhage_coalitieakkoord.pdf <chr [88]> 88 yes
2 Aa en Hunze_coalitieakkoord.pdf <chr [10]> 10 yes
3 Achtkarspelen_coalitieakkoord.pdf <chr [26]> 26 yes
4 Alblasserdam_coalitieakkoord.pdf <chr [13]> 13 yes
5 Albrandswaard_coalitieakkoord.pdf <chr [16]> 16 no
6 Alkmaar_coalitieakkoord.pdf <chr [16]> 16 yes
Probably also best to create a function and not hard code in the creation of the data frame because then I can check multiple words.
All help much appreciated!
Aucun commentaire:
Enregistrer un commentaire