mardi 7 mai 2019

r merge and create a dataframe based on time constraint

I have a dataset with two columns, ID and Start_Date as shown below

  ID        Start_Date
  19        2016-11-24
  19        2016-11-26
  3C        2016-01-16
  3C        2016-03-18
  14        2018-03-03
  14        2018-01-19

A second dataset that contains some random purchase data for each ID at various dates

  ID      Transaction_Date     Item
  19      2015-10-24           Pop
  19      2015-12-11           Crackers
  19      2017-11-25           Honey  
  19      2018-03-14           PBJ
  19      2018-11-24           Roku_Stick
  19      2019-01-10           Pop
  19      2019-02-15           LipBalm  
  19      2019-03-25           Pop
  3C      2015-04-16           Honey
  3C      2016-02-20           PBJ
  3C      2016-08-04           Crackers
  3C      2019-05-12           Roku_Stick          
  14      2017-07-11           Pop   
  14      2018-09-26           Pop          

My intentions are

1) Merge the two dataset by ID, which is easy I know we can do this using the merge function, df_result <- merge(df1, df2, by = "ID", all = TRUE)

2) Retain only rows from the 2nd dataset that are within 2 years of Start_Date in 1st dataset for each Id.

What I mean by this is, consider 1st observation in datataset 1 for example ID 19 the StartDate is 2016-10-24 . So these rows from the 2nd dataset are included and these rows are excluded

  ID      Transaction_Date   Item         Status
  19      2015-10-24          Pop         Exclude, because earlier than start date 2016-11-24
  19      2015-10-24          Crackers    Exclude, because earlier than start date 2016-11-24      
  19      2017-11-25         Honey        Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2018-03-14         PBJ          Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2018-11-24         Roku_Stick   Include, because transaction occurs after the start date 2016-11-24  and within 2 years of 2016-10-24 
  19      2019-01-10         Pop          Exclude, because transaction is after 2 years of start date 2016-11-24
  19      2019-02-15         Lip Balm     Exclude, because transaction is after 2 years of start date 2016-11-24 
  19      2019-03-25         Pop          Exclude, because transaction is after 2 years of start date 2016-11-24 


   ID      Start_Date   Pop   Crackers  Honey  PBJ  Roku_Stick  Lip Balm
   19      2017-11-24   No    Yes       Yes    Yes  Yes         No

Similarly

   ID      Start_Date   Pop   Crackers  Honey  PBJ  Roku_Stick  LipBalm
   19      2016-11-26   No    Yes       Yes    Yes  Yes         No
   3C      2016-01-16   No    Yes       No     Yes  No          No
   14      2018-03-03   Yes   No        No     No   No          No 
   14      2018-01-19   Yes   No        No     No   No          No 

I know a very lengthy way of doing this using

merge,

if-else Start_Date +2 <= Transaction_Date, Include, Exclude,

df <- df[ subset(Include),]

df <- long to wide.

I am interested in exploring a very efficient approach to transforming this dataset. Any assistance is much appreciated. Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire