I am passing some data to a simple code block in R which counts the null values and then performs an ARIMA time series imputation. I have written a very simple 'if' statement which counts the null values in the time series, and if they are less than a certain amount, ignores that column and moves on to the next one (as the ARIMA imputation requires a certain amount of non-null data to work, otherwise it returns an error). Counting the nulls seems to work fine, but the if statement is behaving very strangely and not working. I included a print statement to count the nulls inside and outside the if statement, but the if statement is passing the code to the loop when the if statement is not fulfilled. Here is the code and the output:
stations <- c('BX1', 'BX2', 'BG3') # each station has a different data file
pollutants <- c('nox','no2','pm10','pm25') # each station contains data on a number of pollutants
for (s in stations) {
print(paste('starting imputation for station ', s, sep=" "))
s_result <- read.csv(paste("/path/to/file", s, "_rescaled.csv", sep=""))
for (p in pollutants) {
ts = c()
pcol = paste0(p,"_iqr",sep="") # find the right column
ts = s_result[[pcol]] # get the time series from the column
print(pcol) # check which pollutant we're working on
print(length(ts)) # test the length of the time series
print(sum(is.na(ts))) # test the number of nulls in the time series
if (sum(is.na(ts) != length(ts))) { # if the time series is not completely null
print(sum(is.na(ts))) # check the length of the time series again for testing
usermodel <- arima(ts, order = c(10, 1, 0))$model # calculate the arima
p_result <- na_kalman(ts, model = usermodel, maxgap = 24) # calculate the arima
s_result <- cbind(s_result,p_result) # add the computed column to the dataframe
names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
} else { # otherwise add a null column
p_result <- c(NA, length=length(ts))
s_result <- cbind(s_result,p_result) # enter a null column
names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
}
}
filename = paste0("/path/to/file", s, "_imputed_test.csv", sep="")
write.csv(s_result, filename, row.names = TRUE)
print(paste('completed imputation for station ', s, sep=" "))
}
The problem is, that this if statement is not working correctly as it is passing data to the arima imputation inside the if statement even when the number of nulls is equal to the length of the time series. Here's the output:
[1] "starting imputation for station BG1"
[1] "nox_iqr"
[1] 17520
[1] 4660
[1] 4660
[1] "no2_iqr"
[1] 17520
[1] 4664
[1] 4664
[1] "pm10_iqr"
[1] 17520
[1] 17520
[1] 17520
Error in arima(ts, order = c(10, 1, 0)) : 'x' must be numeric
Clearly something is wrong, as for the pm10 pollutant, there are 17520 nulls, the same as the length of the time series. Therefore the if statement should not run the line counting the number of nulls again inside the 'if' statement, as this line of code should be bypassed. ie. for the time series relating to column pm10_iqr, the number of nulls is 17520, the length of the time series is 17520, and this would cause the arima to fail - hence the if statement should skip this line. But it does not do this.
Where am I going wrong please? This should be very simple but it does not make any sense! I don't write alot of R code, usually Python. Thanks for your help!
Aucun commentaire:
Enregistrer un commentaire