I have a very large data frame with fish species captured as one of the columns. Here is a very shortened example:
ID = seq(1,50,1)
fishes = c("bass", "jack", "snapper")
common = sample(fishes, size = 50, replace = TRUE)
dat = as.data.frame(cbind(ID, common))
I want to remove any species that make up less than a certain percentage of the data. For the example here say I want to remove all species that make up less than 30% of the data:
library(dplyr)
nrow(filter(dat, common == "bass")) #22 rows -> 22/50 -> 44%
nrow(filter(dat, common == "jack")) #12 rows -> 12/50 -> 24%
nrow(filter(dat, common == "snapper")) #16 rows -> 16/50 -> 32%
Here, jacks make up less than 30% of the rows, so I want to remove all the rows with jacks (or all species with less than 15 rows). This is easy to do here, but in reality I have over 700 fish species in my data frame and I want to throw out all species that make up less than 1% of the data (which in my case would be less than 18,003 rows). Is there a streamlined way to do this without having to filter out each species individually?
I imagine perhaps some kind of loop that says if the number of rows for common name = "x" is less than 18003, remove those rows...
Aucun commentaire:
Enregistrer un commentaire