mardi 24 novembre 2020

Apply a model and extract residuals in loop on data frame and make a function out of it in R

I have a dataset of 21 variables and 182 observations. First I need to subset my data by the column "Substrate". Then, for each response variable in the dataset (14 out of the 21), I want to compute a gls() model, extract the residuals and make a new data frame with the residuals of each response variable. I managed to do each of these steps seperately with for-loops and if-statements and it gives me the right output, but when I try to turn it into a function, it doesn't compute all the steps.

This is what my data looks like :

> head(data_soil)
  Position Table Treatment Rep Genotype        Origin     Substrate   TotalC  TotalN TotalS pHH2O pHCaCl2         P          K       Ca         Mg           Mn        Al       Fe         Na      CEC
1        1     1      25-H   1   ABI-25 La Corne Mine NA Waste Rock  0.20885 0.00000   4.32  3.43    3.38  14.14522 0.02588793 17.60550 0.09964484 0.0002030385 0.4408331 2.953925 0.04492462 21.17092
2        2     1    MTN4-L   3     MTN4       Eastern  A Waste Rock  3.04490 0.03996   4.79  3.56    3.44  17.59083 0.14421424 11.34370 0.34310013 0.0018679479 1.1641636 3.775913 0.14227578 16.91523
3        3     1    MIN3-H   3     MIN3       Central NA Waste Rock  0.12163 0.00000   4.46  3.30    3.23  14.33419 0.02193277 16.10658 0.09913785 0.0000000000 0.4265272 3.526153 0.04765741 20.22765
4        4     1    WOL7-H   3     WOL7       Western NA Waste Rock  0.38140 0.00000   3.85  3.29    3.22  16.94811 0.02768036 10.51024 0.11555323 0.0000000000 0.4307781 3.293853 0.03481761 14.41262
5        5     1      13-C   2   ABI-13      Westwood       Control 29.24300 0.53669   0.17  4.13    3.65 142.29403 1.47436677 20.20521 6.68630402 0.2529050181 4.4632055 2.533394 4.26399368 39.87938
6        6     1    BOY3-H   1     BOY3       Western NA Waste Rock  0.14188 0.00000   4.38  3.22    3.18  15.39028 0.01738542 16.72458 0.10505116 0.0000000000 0.4080436 3.454034 0.03663752 20.74553

And this is the code I wrote :

get_residuals <- function(dataset,    # name of the data.frame
                          substrate,  # name of the substrate by which the data is to be subseted
                          colnum) {   # first column in the data.frame containing response variable
  
  # SUBSET
  if (substrate == "Tailings") {
    data_subset <- subset(dataset, Substrate == "Tailings")
  } else if (substrate == "Control") {
    data_subset <- subset(dataset, Substrate == "Control")
  } else if (substrate == "NAWR") {
    data_subset <- subset(dataset, Substrate == "NA Waste Rock")
  } else if (substrate == "AWR") {
    data_subset <- subset(dataset, Substrate == "A Waste Rock")
  } 

  # CREATE A LIST OF DATA FRAMES FOR EACH VARIABLE
  DF.ls <- list()
  for (i in colnum:length(data_subset)){
    DF.ls[[i]] <- data.frame(Genotype = data_subset$Genotype,
                             y = data_subset[[i]])
  }
  DF.ls <- DF.ls[-c(1:colnum-1)]
  
  # COMPUTE MODEL
  model.ls <- list()
  for (i in 1:length(DF.ls)){
    model.ls[[i]] <- gls(y ~ Genotype,
                         data = DF.ls[[i]],
                         na.action = "na.exclude")
  }
  
  # EXTRACT RESIDUALS
  resid.ls <- list()
  for (i in 1:length(model.ls)){
    resid.ls[[i]] <- rbind(residuals(model.ls[[i]], type = "pearson"))
  }
  
  # MAKE DATA FRAME
  data_frame = data.frame(t(as.data.frame(do.call(rbind, resid.ls))))
  colnames(data_frame) <- rbind(colnames(data_subset[colnum:length(data_subset)]))
  
}

I tried running the code like this :

data_tailings_resid <- get_residuals(dataset = data_soil, 
                                     substrate = "Tailings",
                                     colnum = 8)

The output seems to only be the last step of my code :

> data_tailings_resid
     [,1]     [,2]     [,3]     [,4]    [,5]      [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
[1,] "TotalC" "TotalN" "TotalS" "pHH2O" "pHCaCl2" "P"  "K"  "Ca" "Mg" "Mn"  "Al"  "Fe"  "Na"  "CEC"

This is output I was expecting :

> head(data_frame)
      TotalC        TotalN      TotalS      pHH2O       pHCaCl2             P          K         Ca          Mg         Mn         Al          Fe         Na        CEC
1 -1.1793447 -1.542095e-15  1.29958891 -1.5895147 -1.447781e+00 -5.398460e-16  1.6150879 -0.6963919  1.22550246  1.2664670  0.4755890  0.89620191 0.86626534 -0.1737194
2 -0.2379789 -1.927619e-16  0.13563549  0.3725425  5.161653e-01 -2.024422e-16  0.8591796  1.2267315  1.39357556 -0.6936503  0.5313932 -0.35961939 1.48506380  1.3485134
3 -0.3113096 -1.349333e-15 -0.42197707 -0.5712318 -5.287547e-01 -1.494043e+00 -1.1051493 -0.1830617 -0.28885422  0.3832047 -0.2589761 -1.58296163 0.16422915 -0.4488072
4  0.8955341 -1.349333e-15  0.43704768 -0.2731978 -2.643774e-01  2.988086e+00  1.9158605  0.3972578  1.26406058 -0.5093381  1.5471558  1.30145762 0.93252397  1.1286771
5 -0.6525396 -1.527525e+00  0.04973301 -0.0745085 -3.354490e-15 -6.003841e-01 -0.2686305 -0.1455130 -0.06653803  0.5009465 -0.4714063  0.69187950 0.02845169 -0.1988274
6  0.1082666  0.000000e+00  0.59127026  0.6209042  6.420593e-01  1.349615e-16  0.2639568  0.8615335  0.85692038  1.5660618  0.5566046  0.06234642 0.51111909  1.0147509

I tried removing the steps one by one and the first two steps (#SUBSET and #CREATE A LIST OF DATA FRAMES FOR EACH VARIABLE) seem to work, but I after that I get an empty objet (or the previous output).

This is my first time trying to write a function, so I don't know what is wrong since there is no error. Is there an easier way to do this? I am simply missing a link between the steps?

Aucun commentaire:

Enregistrer un commentaire