jeudi 18 février 2021

Filter rows with condition and drop duplicates

I have a big data set and I need to filter it with a condition but the problem that the condition can be true for multiple days for the same ID as you can see in the example below.

Date            ID             Col1            Col2   
2-18-2019       1            ap sd n1           23
2-18-2019       2            dh au n1           88
2-18-2019       3            ap hre n1          92
2-19-2019       1            sd n1 df           23
2-19-2019       2            sf ap n1           78
2-19-2019       3            ap sdv n1          78
2-20-2019       1            ap d3 fg           37
2-20-2019       2            sd ap n1           29
2-20-2019       3            djd dh n1          34
2-21-2019       1            eds ed sd          44
2-21-2019       2            u4r rg ap          34
2-21-2019       3            ufif ew            23
2-22-2019       1            eds sd             44
2-22-2019       2            u4r rg             34
2-22-2019       3            ap ew n1           23
~~~
3-10-2019       2            ap gh n1           78
3-10-2019       3            sd dv n1           78
3-10-2019       1            ap d3 n1           37
3-11-2019       2            sd ap n1           29
3-11-2019       3            djt rh n1          34
~~~
3-21-2019       1            n1 ed sd           44
3-21-2019       2            fg td df           34
3-21-2019       3            ufg yt ew          23
3-22-2019       1            bvs sd n1          44
3-22-2019       2            u4r ap g           34
3-22-2019       3            ap ew n1           23

And the code look like this

import pandas as pd
import openpyxl
import numpy as np

df = pd.read_csv(r"C:\~\SallesDB.csv")
df = df.astype(str)
df = df[(df.Col1.str.contains('ap') & df.Col1.str.contains('n1')) | (df.Col1.str.contains('sd') & df.Col1.str.contains('n1'))]

and the result looks like this:

Date            ID             Col1            Col2   
2-18-2019       1            ap sd n1           23
2-18-2019       3            ap hre n1          92
2-19-2019       1            sd n1 df           23
2-19-2019       2            sf ap n1           78
2-19-2019       3            ap sdv n1          78
2-20-2019       2            sd ap n1           29
2-22-2019       3            ap ew n1           23
~~~
3-10-2019       2            ap gh n1           78
3-10-2019       3            sd dv n1           78
3-10-2019       1            ap d3 n1           37
3-11-2019       2            sd ap n1           29
~~~
3-21-2019       1            n1 ed sd           44
3-22-2019       1            bvs sd n1          44
3-22-2019       3            ap ew n1           23+

And I have tried the solution below but it didn't work since Col1 is not exactly the same

df = df[(df.Col1.str.contains('ap') & df.Col1.str.contains('n1')) | (df.Col1.str.contains('sd') & df.Col1.str.contains('n1'))].drop_duplicates('Col1', keep='last')

And the result I wanted should look for ap and n1 OR sd and n1 , but make sure its not repeated in the Next day, the condition should be: if [ID] == [ID] & [Col2] == [Col2] AND [Col1] contains ('ap'&'n1 OR 'sd'&'n1' is true for 2 days (for example 2-18-2019 And 2-19-2019) keep first. And the result I need should look like this:

Date            ID             Col1            Col2   
2-18-2019       1            ap sd n1           23
2-18-2019       3            ap hre n1          92
2-19-2019       2            sf ap n1           78
2-22-2019       3            ap ew n1           23
~~~
3-10-2019       2            ap gh n1           78
3-10-2019       3            sd dv n1           78
3-10-2019       1            ap d3 n1           37
~~~
3-21-2019       1            n1 ed sd           44
3-22-2019       3            ap ew n1           23+

I have looking for a way to do it for too long and couldn't find any solution

Aucun commentaire:

Enregistrer un commentaire