jeudi 31 janvier 2019

Efficient way to create a DataFrame of custom summary measures by group

I want to create a new dataframe grouped by store that creates several new columns of summaries for each store. In the example below, for each store, I want to create the four variables. The variables are similar to sumif and countif in excel but calculated for each group (store).

  1. A variable that sums all product revenues above say $50.
  2. A variable that sums all product revenues below $50
  3. A variable that counts the number of products with revenues above $50
  4. A variable that counts the number of products with revenues below $50.

I have tried various forms of groupby and agg function. I perceive that I should use lambda but I am unsure how I would strutcure that syntax in order to create multiple variables at one time.

import pandas as pd
import numpy as np

n = 20

df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
            'Revenue': (np.random.random(n) * 50 + 10).round(2)
              })
df

So I can easily calculate multiple summary measures on Revenue. I don't know how to create custom summary measures like sumif or countif:

df2 = df.groupby('Store')['Revenue'].agg({'Rev_sum': 'sum', 'Rev_max': 'max'})

For each store (i.e. store_1, store_2) I want four variables in the new dataframe as described above (i.e. Rev_sum_great_50 , etc...)

Aucun commentaire:

Enregistrer un commentaire