samedi 28 mars 2020

How do you replace missing values with 0 for cases meeting specific conditions in R?

Background:

I am working with a large dataset that contains longitudinal data on gambling behavior of 195,318 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable seperating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they've been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc. As such, there are considerable amounts of "missing values". However, because every instance of gambling is tracked, missing in this data set means that the person did not gamble. In other words, missing equals 0.

Problem/goal:

I want to impute 0 for missing values ("NA"). However I only want to do so under specific circumstances. Specifically I want to impute 0 for missing values within what I define as a participant's "active period" and leave everything else as is.

A participant's active period is every month between their first active month gambling and their last month gambling. For example, for a participant that gambled at month 2, 3, 4, and 7 I want to impute 0 at month 5 and 6. Every other month, i.e. 1 and 9 to 70, I want to stay as NA. I am struggling to write code that achieves this. I'm new to R.

Example data frame and code

Below is example code that produces a data frame that illustrates key characteristics described in my problem. In this code there's only 2 participants, 1 gambling behavior variable and 10 time points ("waves"). I've included a data frame in "long format" and "wide format" because I'm unsure which one would would be most helpful/informative. A time variable is included in the "long format". My actual data set is in long format, but I am familiar with how to switch between the two.

# Example variables and data frame in long form
  # Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)

Created on 2020-03-28 by the reprex package (v0.3.0)

# Example variables and data frame in wide form
  # Includes id variable, days played in given month 
id <- c(1, 2)
daysPlayed.1 <- c(NA, NA)
daysPlayed.2 <- c(2, NA)
daysPlayed.3 <- c(2, 2)
daysPlayed.4 <- c(3, NA)
daysPlayed.5 <- c(NA, 2)
daysPlayed.6 <- c(NA, NA)
daysPlayed.7 <- c(3, 2)
daysPlayed.8 <- c(NA, NA)
daysPlayed.9 <- c(NA, NA)
daysPlayed.10 <- c(NA, NA)
dfWide <- data.frame(id=id, daysPlayed.1 = daysPlayed.1, daysPlayed.2 = daysPlayed.2,
                 daysPlayed.3 = daysPlayed.3, daysPlayed.4 = daysPlayed.4,
                 daysPlayed.5 = daysPlayed.5, daysPlayed.6 = daysPlayed.6,
                 daysPlayed.7 = daysPlayed.7, daysPlayed.8 = daysPlayed.8,
                 daysPlayed.9 = daysPlayed.9, daysPlayed.10 = daysPlayed.10)

Created on 2020-03-28 by the reprex package (v0.3.0)

Aucun commentaire:

Enregistrer un commentaire