jeudi 8 juin 2017

Split dataset to training, cross-validation and test dataset in R. ifelse returns unexpected result

I want to write a function that will split a dataframe to train, cross-validation and test sets.

My code is the following, exemplified by a small dataset:

    library(ISLR)
    library(data.table)
    data <- Auto

    seed <- 12
    train <- 0.7
    test <- 0.6

    # Function_split_test_train_regression <- function(data, train, test, seed){

      set.seed(seed)
      setDT(data)
      data[, index := row.names(data)]
      train_index <- sample(data$index, train * nrow(data))
      test_index <- ifelse(test == 1, setdiff(data$index, train_index), 
                                      sample(setdiff(data$index, train_index),  test * length(setdiff(data$index, train_index))))  
    # etc
    #}

At this point I make some checks and I get a surprising to me result:

       > test == 1
        [1] FALSE
        > sample(setdiff(data$index, train_index), 
                 test * length(setdiff(data$index, train_index)))
         [1] "225" "186" "41"  "381" "356" "178" "147" "158" "21"  "259" "207" "159" "250" "167" "128" "218" "271" "197" "376" "19"  "77" 
        [22] "205" "46"  "3"   "212" "238" "61"  "11"  "68"  "130" "200" "274" "127" "305" "201" "32"  "48"  "184" "290" "349" "155" "370"
        [43] "366" "333" "243" "161" "108" "65"  "125" "306" "357" "189" "337" "118" "364" "6"   "149" "87"  "252" "194" "362" "383" "93" 
        [64] "38"  "18"  "322" "220" "307" "60"  "353"
        > test_index <- ifelse(test == 1, setdiff(data$index, train_index), 
    sample(setdiff(data$index, train_index), 
          test * length(setdiff(data$index, train_index))))
        > test_index
        [1] "219"

Why iflese returns 219 rather than the value of the second argument (since the condition test == 1 evaluates to FALSE) ?

Your advice will be appreciated.

Aucun commentaire:

Enregistrer un commentaire