vendredi 21 juillet 2017

Python Pandas Calculate average days between dates

Working with the following python pandas dataframe df:

Customer_ID | Transaction_ID
ABC            2016-05-06-1234
ABC            2017-06-08-3456
ABC            2017-07-12-5678
ABC            2017-12-20-6789
BCD            2016-08-23-7891
BCD            2016-09-21-2345
BCD            2017-10-23-4567

The date is unfortunately hidden in the transaction_id string. I edited the dataframe this way.

#year of transaction
df['year'] = df['Transaction_ID'].astype(str).str[:4]

#date of transaction
df['date'] = df['Transaction_ID'].astype(str).str[:10]

#format date
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d')

#calculate visit number per year
df['visit_nr_yr'] = df.groupby(['Customer_ID', 'year']).cumcount()+1

Now the df looks like this:

Customer_ID | Transaction_ID    | year  | date        |visit_nr_yr 
ABC            2016-05-06-1234    2016    2016-05-06    1            
ABC            2017-06-08-3456    2017    2017-06-08    1            
ABC            2017-07-12-5678    2017    2017-07-12    2            
ABC            2017-12-20-6789    2017    2017-12-20    3            
BCD            2016-08-23-7891    2016    2016-08-23    1            
BCD            2016-09-21-2345    2016    2016-09-21    2            
BCD            2017-10-23-4567    2017    2017-10-23    1            

I need to calculate the following:

  • What's the average days between visits by visit (so between 1&2 and between 2&3)
  • What's the average days between visits in general

First I would like to include the following column "days_between_visits_by year" (math to be done by Customer_ID):

Customer_ID|Transaction_ID  |year| date       |visit_nr_yr|days_bw_visits_yr 
ABC         2016-05-06-1234  2016  2016-05-06   1             NaN
ABC         2017-06-08-3456  2017  2017-06-08   1             NaN
ABC         2017-07-12-5678  2017  2017-07-12   2             34
ABC         2017-12-20-6789  2017  2017-12-20   3             161
BCD         2016-08-23-7891  2016  2016-08-23   1             NaN
BCD         2016-09-21-2345  2016  2016-09-21   2             29
BCD         2017-10-23-4567  2017  2017-10-23   1             NaN

Please note that I avoided 0s on purpose and kept the Nans, in case somebody had two visits on the same day.

Next I want to calculate the average days between visits by visit (so between 1&2 and between 2&3 within a year). Looking for this output:

avg_days_bw_visits_1_2 | avg_days_bw_visits_2_3
31.5                     161

Finally, I want to calculate the average days between visits in general:

output: 203.8 
#the days between visits are 398,34,161,29,397 and the average of those 
 numbers is 203.8

I'm stuck with at the how to create the column "days_bw_visits_yr". Nans have to be excluded from the math.

Aucun commentaire:

Enregistrer un commentaire