mercredi 7 septembre 2016

creating new variable using conditional mutating returns NAs for a subset of rows

I'm trying to create a new factor variable based on logical conditioning on the date variable:

  'data.frame': 364458 obs. of  5 variables:
   $ user_id         : int  63662 67784 49526 68792 72991 83737 62359 56148 43988 73759 ...
   $ order_id        : int  506302 583495 440168 443783 779331 781477 782951 883492 957769 504401 ...
   $ first_order_date: Date, format: "2015-11-24" "2015-12-15" "2015-06-10" "2015-12-22" ...
   $ order_date2     : Date, format: "2016-02-09" "2016-03-15" "2015-12-22" "2015-12-28" ...
   $ category        : Factor w/ 17 levels "Beauty & Health",..: 13 13 1 16 13 13 13 13 13 13 ...



bb =      
 df %>%
    mutate(days_since_first = as.integer(order_date2 - first_order_date),
    time_after_first =  derivedFactor(
                        "<3months" = order_date2 <= first_order_date +months(3),
                        "3-6months" = (order_date2 <= first_order_date +months(6) & order_date2 > first_order_date +months(3)),

...

                        "15-18months" = (order_date2 <= first_order_date +months(18) & order_date2 > first_order_date +months(15)),
                        "18-21months" = (order_date2 <= first_order_date +months(21) & order_date2 > first_order_date +months(18)),
                        .default = "21month+"))

after running it, I received warnings:

 Warning messages:
 1: In base::max(x, ..., na.rm = na.rm) :
 no non-missing arguments to max; returning -Inf

it still fairly worked well in most cases, but not all

 sum(is.na(bb$time_after_first))
 [1] 7174

I can't see any pattern why these particular entries don't work

summary(bb[is.na(bb$time_after_first), ])

user_id          order_id      first_order_date      order_date2                    category    days_since_first
Min.   : 26481   Min.   : 59269   Min.   :2015-01-31   Min.   :2015-01-  31   Restaurants    :5060   Min.   :  0.0   
1st Qu.: 54253   1st Qu.:500834   1st Qu.:2015-08-31   1st Qu.:2016-02-07   Groceries      : 774   1st Qu.: 41.5   
Median : 62129   Median :617945   Median :2015-11-30   Median :2016-03-31   Drinks         : 325   Median :106.0   
Mean   : 62402   Mean   :600803   Mean   :2015-11-05   Mean   :2016-03-12   Sweet          : 215   Mean   :128.3   
3rd Qu.: 74726   3rd Qu.:727110   3rd Qu.:2016-01-31   3rd Qu.:2016-05-13   Beauty & Health:  46   3rd Qu.:175.0   
Max.   :106433   Max.   :957931   Max.   :2016-05-31   Max.   :2016-09-04   Fashion        :  23   Max.   :546.0   
                                                                         (Other)        :  12                   
  time_after_first
 <3months   :   0   
 3-6months  :   0   
 6-9months  :   0   
 9-12months :   0   
 12-15months:   0   
 (Other)    :   0   
 NA's       :6455

Also, I tried to use ordinary ifelse() statements to achieve this,

 bb2 = 
  all_orders3 %>% select(user_id, order_id, first_order_date, order_date2,  category) %>% 
  mutate(days_since_first = as.integer(order_date2 - first_order_date),
       time_after_first=  as.factor(ifelse(order_date2 <=   first_order_date +months(3), "<3months", 
                               ifelse(order_date2 <= first_order_date +months(6) & order_date2 > first_order_date +months(3), "3-6months",


 ....

 ifelse(order_date2 <= first_order_date +months(24) & order_date2 >    first_order_date +months(21), "21-24months",
                                                                                    "24months+"))))))))))

with no warnings received but with more NA's generated and still no clear pattern why this is happening

 summary(bb2[is.na(bb2$time_after_first), ])
    user_id          order_id      first_order_date      order_date2                    category    days_since_first
   Min.   : 26481   Min.   : 59269   Min.   :2015-01-31   Min.   :2015-01-31   Restaurants    :5784   Min.   :  0.0   
   1st Qu.: 54152   1st Qu.:507272   1st Qu.:2015-08-31   1st Qu.:2016-02-10   Groceries      : 950   1st Qu.: 52.0   
   Median : 60788   Median :634594   Median :2015-11-30   Median :2016-04-08   Drinks         : 338   Median :123.0   
   Mean   : 61417   Mean   :628492   Mean   :2015-10-27   Mean   :2016-03-25   Sweet          : 237   Mean   :150.2   
   3rd Qu.: 74700   3rd Qu.:780150   3rd Qu.:2016-01-31   3rd Qu.:2016-05-29   Beauty & Health:  55   3rd Qu.:211.0   
   Max.   :106433   Max.   :958410   Max.   :2016-05-31   Max.   :2016-09-04   Fashion        :  28   Max.   :582.0   
                                                                           (Other)        :  15                   
     time_after_first
     <3months   :   0   
     12-15months:   0   
     15-18months:   0   
     18-21months:   0   
     3-6months  :   0   
     (Other)    :   0   
     NA's       :7407   

Any useful suggestions how to overcome this will be welcome, thanks!

Aucun commentaire:

Enregistrer un commentaire