mardi 10 mars 2020

Problem with counting null values in 'if statement' in R

I am passing some data to a simple code block in R which counts the null values and then performs an ARIMA time series imputation. I have written a very simple 'if' statement which counts the null values in the time series, and if they are less than a certain amount, ignores that column and moves on to the next one (as the ARIMA imputation requires a certain amount of non-null data to work, otherwise it returns an error). Counting the nulls seems to work fine, but the if statement is behaving very strangely and not working. I included a print statement to count the nulls inside and outside the if statement, but the if statement is passing the code to the loop when the if statement is not fulfilled. Here is the code and the output:

stations <- c('BX1', 'BX2', 'BG3') # each station has a different data file
pollutants <- c('nox','no2','pm10','pm25') # each station contains data on a number of pollutants
for (s in stations) {
  print(paste('starting imputation for station ', s, sep=" "))
  s_result <- read.csv(paste("/path/to/file", s, "_rescaled.csv", sep=""))
  for (p in pollutants) {
    ts = c()
    pcol = paste0(p,"_iqr",sep="") # find the right column
    ts = s_result[[pcol]]  # get the time series from the column
    print(pcol) # check which pollutant we're working on
    print(length(ts)) # test the length of the time series
    print(sum(is.na(ts))) # test the number of nulls in the time series
    if (sum(is.na(ts) != length(ts))) {       # if the time series is not completely null
      print(sum(is.na(ts)))            # check the length of the time series again for testing
      usermodel <- arima(ts, order = c(10, 1, 0))$model      # calculate the arima
      p_result <- na_kalman(ts, model = usermodel, maxgap = 24)    # calculate the arima
      s_result <- cbind(s_result,p_result) # add the computed column to the dataframe
      names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
    } else { # otherwise add a null column
      p_result <- c(NA, length=length(ts))
      s_result <- cbind(s_result,p_result) # enter a null column
      names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
    }
  }
  filename = paste0("/path/to/file", s, "_imputed_test.csv", sep="")
  write.csv(s_result, filename, row.names = TRUE) 
  print(paste('completed imputation for station ', s, sep=" "))
}

The problem is, that this if statement is not working correctly as it is passing data to the arima imputation inside the if statement even when the number of nulls is equal to the length of the time series. Here's the output:

[1] "starting imputation for station  BG1"
[1] "nox_iqr"
[1] 17520
[1] 4660
[1] 4660
[1] "no2_iqr"
[1] 17520
[1] 4664
[1] 4664
[1] "pm10_iqr"
[1] 17520
[1] 17520
[1] 17520
Error in arima(ts, order = c(10, 1, 0)) : 'x' must be numeric

Clearly something is wrong, as for the pm10 pollutant, there are 17520 nulls, the same as the length of the time series. Therefore the if statement should not run the line counting the number of nulls again inside the 'if' statement, as this line of code should be bypassed. ie. for the time series relating to column pm10_iqr, the number of nulls is 17520, the length of the time series is 17520, and this would cause the arima to fail - hence the if statement should skip this line. But it does not do this.

Where am I going wrong please? This should be very simple but it does not make any sense! I don't write alot of R code, usually Python. Thanks for your help!

Aucun commentaire:

Enregistrer un commentaire