lundi 16 avril 2018

Python pandas if statement based on two conditions

Piggy backing off this question python pandas new column categorization based on conditions in other columns

#code to re-create my example date
df = pd.DataFrame({'customer_id': ['abc','abc','xyz','xyz','xyz','xyz','thr','thr','abc','abc'],
                   'transaction_id': ['A123','A123','B345','B345','C567','C567','D678','D678','E789','E789'], 
                   'product_id': [255472, 251235, 253764,257344,221577,209809,223551,290678,908354,909238],
                   'product_category': ['X','X','Y','Y','X','Y','Y','X','Y','Z']})

#example data
customer_id|    transaction_id | product_id | product_category
abc             A123              255472             X
abc             A123              251235             X
xyz             B345              253764             Y
xyz             B345              257344             Y
xyz             C567              221577             X
xyz             C567              209809             Y
thr             D678              223551             Y
thr             D678              290678             X
abc             E789              908354             Y
abc             E789              909238             Z

I want to flag all customer_ids that have X and Y in different transactions (not in the same transaction).

#expected output
customer_id|    transaction_id | product_id | product_category | flag 
abc             A123              255472             X            1
abc             A123              251235             X            1
xyz             B345              253764             Y            0
xyz             B345              257344             Y            0
xyz             C567              221577             X            0
xyz             C567              209809             Y            0
thr             D678              223551             Y            0
thr             D678              290678             X            0
abc             E789              908354             Y            1
abc             E789              909238             Z            1

I couldn't come up with a clean solution. In the example above, we have customer abc who has a transaction with only product category X and then a transaction with product category Y and Z. This is the customer I want to flag, they had X and Y, but in different transaction_ids.

One way of doing it I thought of would be to use the code from my previous answer:

df['pre_flag']=df.groupby('transaction_id')['product_category'].transform(lambda x: x + ' only' if len(set(x)) < 2 else ' & '.join(set(x)))

And then split the dataset into two:

df_1 = df.loc[df['pre_flag'] == 'X&Y'].copy()
df_2 = df.loc[df['pre_flag'] != 'X&Y'].copy()

... and use an isin statement; but this is very messy; there must be a better approach. Thank You!

Aucun commentaire:

Enregistrer un commentaire