lundi 13 mai 2019

Standardize columns in a dataframe by subsets obtained with breakpoints

Quite hard to reproduce but let's say:

I have a dataframe with 107 columns regarding monthly wind speed at weather stations (monthly data from 1961). I want to standardize the data for every column in respect with the breakpoins in the time series. For example if a column has the first BP in 1971-04, the standardize should be done using the mean and standard deviation from the first recording (1961-01) until the first BP (1971-04). If the second BP is in 1989-05, the mean and sd has to be from the first BP until the second one. Then, I am replacing the original data with the newly obtained ones.

The code I did looks like:

for (a in names(df[,2:ncol(df)])){
  print(a)
  stat <- df[,c('date',a)]
  bp <- breakpoints(stat[,2] ~ 1)
  bp <- bp$breakpoints  
  dates <- stat[bp,] # create a df with the breakpoints
  if(nrow(dates==0)){ # condition if a column does not have any BP
    stat[,2] <- (stat[,2] - mean(stat[,2], na.rm = T))/sd(stat[,2], na.rm = T)
    df[,a] <- stat[,2]
  } else {
    for (b in 1:nrow(dates)){
      print(b)
      if(b==1){
        substr <- stat[stat$date >= min(stat$date) & stat$date < dates$date[b],]
        substr[,2] <- (substr[,2] - mean(substr[,2], na.rm = T))/sd(substr[,2], na.rm = T)
        df[,a][df$date >= min(df$date) & df$date < dates$date[b]] <- substr[,2]
      } else if (b == nrow(dates)){
        substr <- stat[stat$date >= dates$date[b-1] & stat$date <= max(stat$date),]
        substr[,2] <- (substr[,2] - mean(substr[,2], na.rm = T))/sd(substr[,2], na.rm = T)
        df[,a][df$date >= dates$date[b-1] & df$date < max(stat$date)] <- substr[,2]
      } else if (b > 1) {
        substr <- stat[stat$date >= dates$date[b-1] & stat$date < dates$date[b],]
        substr[,2] <- (substr[,2] - mean(substr[,2], na.rm = T))/sd(substr[,2], na.rm = T)
        df[,a][df$date >= dates$date[b-1] & df$date < dates$date[b]] <- substr[,2]
      }
    }
  }
}

However, when I am doing the validation manually, the values are wrong. Does any one has any tips to simplify this code? (and make it working of course)? Thanks

Aucun commentaire:

Enregistrer un commentaire