Piggy backing off this question python pandas new column categorization based on conditions in other columns
#code to re-create my example date
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc'],
'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789'],
'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238],
'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z']})
#example data
customer_id| transaction_id | product_id | product_category
abc A123 255472 X
abc A123 251235 X
xyz B345 253764 Y
xyz B345 257344 Y
xyz C567 221577 X
xyz C567 209809 Y
thr D678 223551 Y
thr D678 290678 X
abc E789 908354 Y
abc E789 909238 Z
I want to flag all customer_ids that have X and Y in different transactions (not in the same transaction).
#expected output
customer_id| transaction_id | product_id | product_category | flag
abc A123 255472 X 1
abc A123 251235 X 1
xyz B345 253764 Y 0
xyz B345 257344 Y 0
xyz C567 221577 X 0
xyz C567 209809 Y 0
thr D678 223551 Y 0
thr D678 290678 X 0
abc E789 908354 Y 1
abc E789 909238 Z 1
I couldn't come up with a clean solution. In the example above, we have customer abc who has a transaction with only product category X and then a transaction with product category Y and Z. This is the customer I want to flag, they had X and Y, but in different transaction_ids.
One way of doing it I thought of would be to use the code from my previous answer:
df['pre_flag']=df.groupby('transaction_id')['product_category'].transform(lambda x: x + ' only' if len(set(x)) < 2 else ' & '.join(set(x)))
And then split the dataset into two:
df_1 = df.loc[df['pre_flag'] == 'X&Y'].copy()
df_2 = df.loc[df['pre_flag'] != 'X&Y'].copy()
... and use an isin statement; but this is very messy; there must be a better approach. Thank You!
Aucun commentaire:
Enregistrer un commentaire