lundi 31 août 2020

Can I apply vectorization here? Or should I think about this differently?

To put it simply, I have rows of activity that happen in a given month of the year. I want to append additional rows of inactivity in between this activity, while resetting the month values into a sequence. For example, if I have months 2, 5, 7, I need to map these to 1, 4, 7, while my inactive months happen in 2, 3, 5, and 6. So, I would have to add four rows with this inactivity. I've done this with dictionaries and for-loops, but I know this is not efficient, especially when I move this to thousands of rows of data to process. Any suggestions on how to optimize this? Do I need to think about the data format differently? I've had a suggestion to make lists and then move that to the dataframe at the end, but I don't see a huge gain there. I don't know enough of NumPy to figure out how to do this with vectorization since that's super fast and it would be awesome to learn something new. Below is my code with the steps I took:

df = pd.DataFrame({'col1': ['A','A', 'B','B','B','C','C'], 'col2': ['X','Y','X','Y','Z','Y','Y'], 'col3': [1, 8, 2, 5, 7, 6, 7]})

Output:

  col1 col2  col3
0    A    X     1
1    A    Y     8
2    B    X     2
3    B    Y     5
4    B    Z     7
5    C    Y     6
6    C    Y     7

I'm creating a dictionary to handle this in for loops:

df1 = df.groupby('col1')['col3'].apply(list).to_dict()
df2 = df.groupby('col1')['col2'].apply(list).to_dict()
max_num = max(df.col3)

Output:

{'A': [1, 8], 'B': [2, 5, 7], 'C': [6, 7]}
{'A': ['X', 'Y'], 'B': ['X', 'Y', 'Z'], 'C': ['Y', 'Y']}
8

And now I'm adding those rows using my dictionaries by creating a new data frame:

df_new = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
for key in df1.keys():
    k = 1
    if list(df1[key])[-1] - list(df1[key])[0] + 1 < max_num:
        for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
            if i in df1[key]:
                df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
            else:
                df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
            k += 1
        df_new = df_new.append({'col1': key, 'col2': 'E', 'col3': str(k)}, ignore_index=True)
    else:
        for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
            if i in df1[key]:
                df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
            else:
                df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
            k += 1

Output:

   col1 col2 col3
0     A    X    1
1     A    N    2
2     A    N    3
3     A    N    4
4     A    N    5
5     A    N    6
6     A    N    7
7     A    Y    8
8     B    X    1
9     B    N    2
10    B    N    3
11    B    Y    4
12    B    N    5
13    B    Z    6
14    B    E    7
15    C    Y    1
16    C    Y    2
17    C    E    3

And then I pivot to the form I want it:

df_pivot = df_new.pivot(index='col1', columns='col3', values='col2')

Output:

col3    1   2   3   4   5   6   7   8
col1                                
A   X   N   N   N   N   N   N   Y
B   X   N   N   Y   N   Z   E   NaN
C   Y   Y   E   NaN NaN NaN NaN NaN

Thanks for the help.

Aucun commentaire:

Enregistrer un commentaire