Working with the following python pandas dataframe df:
Customer_ID | Transaction_ID
ABC 2016-05-06-1234
ABC 2017-06-08-3456
ABC 2017-07-12-5678
ABC 2017-12-20-6789
BCD 2016-08-23-7891
BCD 2016-09-21-2345
BCD 2017-10-23-4567
The date is unfortunately hidden in the transaction_id string. I edited the dataframe this way.
#year of transaction
df['year'] = df['Transaction_ID'].astype(str).str[:4]
#date of transaction
df['date'] = df['Transaction_ID'].astype(str).str[:10]
#format date
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d')
#calculate visit number per year
df['visit_nr_yr'] = df.groupby(['Customer_ID', 'year']).cumcount()+1
Now the df looks like this:
Customer_ID | Transaction_ID | year | date |visit_nr_yr
ABC 2016-05-06-1234 2016 2016-05-06 1
ABC 2017-06-08-3456 2017 2017-06-08 1
ABC 2017-07-12-5678 2017 2017-07-12 2
ABC 2017-12-20-6789 2017 2017-12-20 3
BCD 2016-08-23-7891 2016 2016-08-23 1
BCD 2016-09-21-2345 2016 2016-09-21 2
BCD 2017-10-23-4567 2017 2017-10-23 1
I need to calculate the following:
- What's the average days between visits by visit (so between 1&2 and between 2&3)
- What's the average days between visits in general
First I would like to include the following column "days_between_visits_by year" (math to be done by Customer_ID):
Customer_ID|Transaction_ID |year| date |visit_nr_yr|days_bw_visits_yr
ABC 2016-05-06-1234 2016 2016-05-06 1 NaN
ABC 2017-06-08-3456 2017 2017-06-08 1 NaN
ABC 2017-07-12-5678 2017 2017-07-12 2 34
ABC 2017-12-20-6789 2017 2017-12-20 3 161
BCD 2016-08-23-7891 2016 2016-08-23 1 NaN
BCD 2016-09-21-2345 2016 2016-09-21 2 29
BCD 2017-10-23-4567 2017 2017-10-23 1 NaN
Please note that I avoided 0s on purpose and kept the Nans, in case somebody had two visits on the same day.
Next I want to calculate the average days between visits by visit (so between 1&2 and between 2&3 within a year). Looking for this output:
avg_days_bw_visits_1_2 | avg_days_bw_visits_2_3
31.5 161
Finally, I want to calculate the average days between visits in general:
output: 203.8
#the days between visits are 398,34,161,29,397 and the average of those
numbers is 203.8
I'm stuck with at the how to create the column "days_bw_visits_yr". Nans have to be excluded from the math.
Aucun commentaire:
Enregistrer un commentaire