vendredi 3 mai 2019

In R, how do I classify each row of a data frame based on the bin its values fall into?

In R, I want to classify each rows of the data frame by binning the values and using the number (sum) of values in each bin to assign them into 2 groups (classes) by using if-else logic.

Within an R for-loop, I used the R cut and split commands to bin the values by row.
The bins (ranges) are: 1..9, 10..19, 20..29, 30..39, 40..49.
If a row contains 1 pair of values falling in the same bin (range), say 10..19, then it should be classified as "P". If it contains 2 pairs falling into 2 different bins (ranges), then they should be classified as "PP".
Then I created 2 logical statements that use the sum of the values in each bin to create 2 new variables named p and pp returning TRUE or FALSE. Finally, I used p and pp as conditions in the if-else statement to assign each row to either class P (1st row), or class PP (2nd row).

First, I created a data frame x:

n1 <- c(1, 7); n2 <- c(2, 11); n3 <- c(10, 14); n4 <- c(23, 32); n5 <- c(37, 37); n6 <- c(45, 41)
x <- data.frame(n1, n2, n3, n4, n5, n6)
x
  n1 n2 n3 n4 n5 n6
1  1  2 10 23 37 45
2  7 11 14 32 37 41

The 1st row should be classified as "P", because it has 1 pair of values (1, 2) falling in the same bin 1..10.
The 2nd row should be classified as "PP", because it has 2 pairs of values (11, 14 and 32, 37) falling in 2 bins: 10..19 and 30..39, accordingly.

So, after creating the data frame x, I created a for-loop:

for(i in nrow(x)){

# binning the data:
  bins <- split(as.numeric(x[i, ]), cut(as.numeric(x[i, ]), c(0, 9, 19, 29, 39, 49)))

  p <- (sum(lengths(bins) == 2) == 1 & sum(lengths(bins) == 1) == 4) # P - pair of one color
  pp <- (sum(lengths(bins) == 2) == 2 & sum(lengths(bins) == 1) == 2 & sum(lengths(bins) == 0) == 1) # PP - pair of two colors

  if(p){
    x$types <- "P"
  } else if(pp){
    x$types <- "PP"
  } else{
    stop("error")
  }
  }

print(x)

I want to create a new column named types, holding the class P or PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 P
2  7 11 14 32 37 41 PP

Instead the code returned only PP:

  n1 n2 n3 n4 n5 n6 types
1  1  2 10 23 37 45 PP
2  7 11 14 32 37 41 PP

This is because the loop runs twice over the rows. But if it runs only once, all the rows are classified as "P", instead of "PP". I expect it's something very simple, just was not able to figure it out so far.

Aucun commentaire:

Enregistrer un commentaire