I have a big data set and I need to filter it with a condition but the problem that the condition can be true for multiple days for the same ID as you can see in the example below.
Date ID Col1 Col2
2-18-2019 1 ap sd n1 23
2-18-2019 2 dh au n1 88
2-18-2019 3 ap hre n1 92
2-19-2019 1 sd n1 df 23
2-19-2019 2 sf ap n1 78
2-19-2019 3 ap sdv n1 78
2-20-2019 1 ap d3 fg 37
2-20-2019 2 sd ap n1 29
2-20-2019 3 djd dh n1 34
2-21-2019 1 eds ed sd 44
2-21-2019 2 u4r rg ap 34
2-21-2019 3 ufif ew 23
2-22-2019 1 eds sd 44
2-22-2019 2 u4r rg 34
2-22-2019 3 ap ew n1 23
~~~
3-10-2019 2 ap gh n1 78
3-10-2019 3 sd dv n1 78
3-10-2019 1 ap d3 n1 37
3-11-2019 2 sd ap n1 29
3-11-2019 3 djt rh n1 34
~~~
3-21-2019 1 n1 ed sd 44
3-21-2019 2 fg td df 34
3-21-2019 3 ufg yt ew 23
3-22-2019 1 bvs sd n1 44
3-22-2019 2 u4r ap g 34
3-22-2019 3 ap ew n1 23
And the code look like this
import pandas as pd
import openpyxl
import numpy as np
df = pd.read_csv(r"C:\~\SallesDB.csv")
df = df.astype(str)
df = df[(df.Col1.str.contains('ap') & df.Col1.str.contains('n1')) | (df.Col1.str.contains('sd') & df.Col1.str.contains('n1'))]
and the result looks like this:
Date ID Col1 Col2
2-18-2019 1 ap sd n1 23
2-18-2019 3 ap hre n1 92
2-19-2019 1 sd n1 df 23
2-19-2019 2 sf ap n1 78
2-19-2019 3 ap sdv n1 78
2-20-2019 2 sd ap n1 29
2-22-2019 3 ap ew n1 23
~~~
3-10-2019 2 ap gh n1 78
3-10-2019 3 sd dv n1 78
3-10-2019 1 ap d3 n1 37
3-11-2019 2 sd ap n1 29
~~~
3-21-2019 1 n1 ed sd 44
3-22-2019 1 bvs sd n1 44
3-22-2019 3 ap ew n1 23+
And I have tried the solution below but it didn't work since Col1 is not exactly the same
df = df[(df.Col1.str.contains('ap') & df.Col1.str.contains('n1')) | (df.Col1.str.contains('sd') & df.Col1.str.contains('n1'))].drop_duplicates('Col1', keep='last')
And the result I wanted should look for ap and n1 OR sd and n1 , but make sure its not repeated in the Next day, the condition should be: if [ID] == [ID] & [Col2] == [Col2] AND [Col1] contains ('ap'&'n1 OR 'sd'&'n1' is true for 2 days (for example 2-18-2019 And 2-19-2019) keep first. And the result I need should look like this:
Date ID Col1 Col2
2-18-2019 1 ap sd n1 23
2-18-2019 3 ap hre n1 92
2-19-2019 2 sf ap n1 78
2-22-2019 3 ap ew n1 23
~~~
3-10-2019 2 ap gh n1 78
3-10-2019 3 sd dv n1 78
3-10-2019 1 ap d3 n1 37
~~~
3-21-2019 1 n1 ed sd 44
3-22-2019 3 ap ew n1 23+
I have looking for a way to do it for too long and couldn't find any solution
Aucun commentaire:
Enregistrer un commentaire