jeudi 1 juin 2017

R - Replace observations with dummy if in top x% of var

I have some data in a large data frame (about 80x300) that looks something like this:

dum <- data.frame(id=c("a", "b", "c", "d", "e"),
                 v1=c(2, 7, 8, 5, 0),
                 v2=c(9, 2, 4, 6, 1),
                 v3=c(2, 2, 6, 1, 7))

I would like to turn each variable into a dichotomous variable indicating whether or not each particular observation is in the top 20% of each variable. {I'll then later merge the dummy dataset and the raw data set later, which is unimportant for now but if anyone wants to get creative that's the full plan.} Now the output dataframe should look something like this:

id     v1     v2     v3
a      0      1      0
b      0      0      0
c      1      0      0
d      0      0      0
e      0      0      1

My attempt at this looks like the following:

top <- 20  # set percentage
for(i in 2:ncol(dum)) {
  for(j in 1:nrow(dum)) {
    ifelse(dum[j,i]>=unname(quantile(dum[,i],probs=((100-top)/100))), dum[j,i]<-1, dum[j,i]<-0)
  }
}

However, when I run this command I end up getting more ones than desired in the output dataset in some cases and exactly the number I want in other cases. Instead of looking like what I said it should look like above, it looks like this:

id     v1     v2     v3
a      0      1      0
b      0      0      0
c      1      0      0
d      1      1      0
e      0      1      1

Can anyone help identify where I am going wrong? A few notes: 1) I am prepared to get hated on for using loops, especially nested loops, but it's something I'm familiar with and computational time is not a concern here. 2) Based on my googling it seems using the apply family of functions could be useful but I'm not very familiar with them so I wouldn't know where to begin. 3) I included the unname() command as an attempted fix but it runs the same with or without it. 4) The YES/NO part of the ifelse() statement looks funny to me but when I tried to do ifelse(cond, 1, 0) it didn't make any changes to the data frame, and i didn't understand why.

Thanks!

Aucun commentaire:

Enregistrer un commentaire