I'm trying to create a new factor variable based on logical conditioning on the date variable:
'data.frame': 364458 obs. of 5 variables:
$ user_id : int 63662 67784 49526 68792 72991 83737 62359 56148 43988 73759 ...
$ order_id : int 506302 583495 440168 443783 779331 781477 782951 883492 957769 504401 ...
$ first_order_date: Date, format: "2015-11-24" "2015-12-15" "2015-06-10" "2015-12-22" ...
$ order_date2 : Date, format: "2016-02-09" "2016-03-15" "2015-12-22" "2015-12-28" ...
$ category : Factor w/ 17 levels "Beauty & Health",..: 13 13 1 16 13 13 13 13 13 13 ...
bb =
df %>%
mutate(days_since_first = as.integer(order_date2 - first_order_date),
time_after_first = derivedFactor(
"<3months" = order_date2 <= first_order_date +months(3),
"3-6months" = (order_date2 <= first_order_date +months(6) & order_date2 > first_order_date +months(3)),
...
"15-18months" = (order_date2 <= first_order_date +months(18) & order_date2 > first_order_date +months(15)),
"18-21months" = (order_date2 <= first_order_date +months(21) & order_date2 > first_order_date +months(18)),
.default = "21month+"))
after running it, I received warnings:
Warning messages:
1: In base::max(x, ..., na.rm = na.rm) :
no non-missing arguments to max; returning -Inf
it still fairly worked well in most cases, but not all
sum(is.na(bb$time_after_first))
[1] 7174
I can't see any pattern why these particular entries don't work
summary(bb[is.na(bb$time_after_first), ])
user_id order_id first_order_date order_date2 category days_since_first
Min. : 26481 Min. : 59269 Min. :2015-01-31 Min. :2015-01- 31 Restaurants :5060 Min. : 0.0
1st Qu.: 54253 1st Qu.:500834 1st Qu.:2015-08-31 1st Qu.:2016-02-07 Groceries : 774 1st Qu.: 41.5
Median : 62129 Median :617945 Median :2015-11-30 Median :2016-03-31 Drinks : 325 Median :106.0
Mean : 62402 Mean :600803 Mean :2015-11-05 Mean :2016-03-12 Sweet : 215 Mean :128.3
3rd Qu.: 74726 3rd Qu.:727110 3rd Qu.:2016-01-31 3rd Qu.:2016-05-13 Beauty & Health: 46 3rd Qu.:175.0
Max. :106433 Max. :957931 Max. :2016-05-31 Max. :2016-09-04 Fashion : 23 Max. :546.0
(Other) : 12
time_after_first
<3months : 0
3-6months : 0
6-9months : 0
9-12months : 0
12-15months: 0
(Other) : 0
NA's :6455
Also, I tried to use ordinary ifelse() statements to achieve this,
bb2 =
all_orders3 %>% select(user_id, order_id, first_order_date, order_date2, category) %>%
mutate(days_since_first = as.integer(order_date2 - first_order_date),
time_after_first= as.factor(ifelse(order_date2 <= first_order_date +months(3), "<3months",
ifelse(order_date2 <= first_order_date +months(6) & order_date2 > first_order_date +months(3), "3-6months",
....
ifelse(order_date2 <= first_order_date +months(24) & order_date2 > first_order_date +months(21), "21-24months",
"24months+"))))))))))
with no warnings received but with more NA's generated and still no clear pattern why this is happening
summary(bb2[is.na(bb2$time_after_first), ])
user_id order_id first_order_date order_date2 category days_since_first
Min. : 26481 Min. : 59269 Min. :2015-01-31 Min. :2015-01-31 Restaurants :5784 Min. : 0.0
1st Qu.: 54152 1st Qu.:507272 1st Qu.:2015-08-31 1st Qu.:2016-02-10 Groceries : 950 1st Qu.: 52.0
Median : 60788 Median :634594 Median :2015-11-30 Median :2016-04-08 Drinks : 338 Median :123.0
Mean : 61417 Mean :628492 Mean :2015-10-27 Mean :2016-03-25 Sweet : 237 Mean :150.2
3rd Qu.: 74700 3rd Qu.:780150 3rd Qu.:2016-01-31 3rd Qu.:2016-05-29 Beauty & Health: 55 3rd Qu.:211.0
Max. :106433 Max. :958410 Max. :2016-05-31 Max. :2016-09-04 Fashion : 28 Max. :582.0
(Other) : 15
time_after_first
<3months : 0
12-15months: 0
15-18months: 0
18-21months: 0
3-6months : 0
(Other) : 0
NA's :7407
Any useful suggestions how to overcome this will be welcome, thanks!
Aucun commentaire:
Enregistrer un commentaire