dimanche 13 octobre 2019

R replace_na values conditionally by column with multiple conditions

My question is similar to other replace_na posts but I can't find the right combination of answers.

I have a dataframe with inflation rates for all countries over 8 years (wide format - countries as rows and years as columns).

  1. Some countries have NAs for all 8 years (columns 3:10), and in that case I want to replace all NAs with the column mean
library(tidyverse)

sample %>% 
  mutate_if((rowSums(is.na[,3:10]))!=8, replace_na = colMeans(na.rm=T)) 

This is close but something is wrong.

  1. Other countries only have NAs in some columns, in which case I want to replace NA with the previous year's value
library(zoo)
sample %>% 
  mutate_if((rowSums(is.na[,3:10]))!=8, replace_na = colMeans(na.rm=T)),
         is.na[,4:10], na.locf(fromLast = TRUE))) 

Tried using na.locf from the zoo package but can't get it right with the other conditions

  1. The final condition is that, if the NA is in the first year (2007), I want to replace it with the 2007 column mean instead of the next year (2008 was the financial crisis so all the inflation rates went nuts).
  mutate_if((rowSums(is.na[,3:10]))!=8, replace_na = colMeans(na.rm=T)),
         is.na[,4:10], na.locf(fromLast = TRUE)),
          is.na("2007"), replace = colMeans("2007", na.rm = TRUE)) 

But this is full of errors and I'm stuck trying to link all these conditions together - pretty new to ifelse statements. I'm trying to find a dplyr solution as that's the syntax I'm most familiar with, but maybe it's easier in base R or data.table

running R 3.6.1

sample <- structure(list(`Country Name` = c("Aruba", "Afghanistan", "Angola", 
"Albania", "Andorra", "Arab World", "United Arab Emirates", "Argentina", 
"Armenia", "American Samoa", "Antigua and Barbuda", "Australia"
), `Country Code` = c("ABW", "AFG", "AGO", "ALB", "AND", "ARB", 
"ARE", "ARG", "ARM", "ASM", "ATG", "AUS"), `2007` = c(5.39162036843645, 
8.68057078513406, 12.2514974459487, 2.93268248162318, NA, 4.74356585295154, 
NA, NA, 4.40736089644519, NA, 1.41605259409743, 2.32761128891476
), `2008` = c(8.95722105296535, 26.4186641547444, 12.4758291326398, 
3.36313757366391, NA, 11.2706652380848, 12.2504202448139, NA, 
8.94995335353386, NA, 5.33380639820232, 4.35029854990047), `2009` = c(-2.13630037272305, 
-6.81116108898995, 13.7302839288409, 2.23139683475865, NA, 2.92089711805365, 
1.55980098148558, NA, 3.40676682683799, NA, -0.550159995508869, 
1.77111716621252), `2010` = c(2.07773902027782, 2.1785375238942, 
14.4696564932574, 3.61538461538463, NA, 3.91106195534027, 0.879216764156813, 
NA, 8.17636138473956, NA, 3.3700254022015, 2.91834002677376), 
    `2011` = c(4.31633194082721, 11.8041858089129, 13.4824679218511, 
    3.44283593170005, NA, 4.75316388885632, 0.877346595685083, 
    NA, 7.6500080785929, NA, 3.45674967234599, 3.30385015608744
    ), `2012` = c(0.627927921638161, 6.44121280934118, 10.2779049218839, 
    2.03642235579081, NA, 4.61184432206646, 0.662268900269082, 
    NA, 2.55802007757907, NA, 3.37688044338879, 1.76278015613193
    ), `2013` = c(-2.37226328015073, 7.38577178397857, 8.77781429332619, 
    1.92544399507649, NA, 3.23423783752364, 1.10111836375706, 
    NA, 5.78966778544654, NA, 1.05949782356168, 2.44988864142539
    ), `2014` = c(0.421637771012246, 4.67399603536339, 7.28038730361125, 
    1.61304235314414, NA, 2.77261158414198, 2.34626865671643, 
    NA, 2.98130868933673, NA, 1.08944157435363, 2.48792270531403
    )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-12L))

Aucun commentaire:

Enregistrer un commentaire