dimanche 2 février 2020

Efficient selection of rows in Pandas dataframe based on multiple conditions across columns

I am trying to distribute rows of a pandas data frame into a bucket based on conditions.

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name4    4      4
name5    2      4

I need a count of 3 values for topic1 and 4 values for topic2 in bucket 1 if they fullfil the condition that they are 4 in my bucket. Once the bucket is filled, I want to stop the code. Hence, my bucket variables looks like this:

bucket1_topic1 = 2
bucket1_topic2 = 3

I wrote this pretty convoluted starter that is 'almost' working...But I am having issues in dealing with rows that fulfil the conditions for both topic1 and topic2. What is the more efficent & correct way to do this?

rows_list = []

counter1 = 0
counter2 = 0

for index,row in data.iterrows():
    if counter1 < bucket1_topic1:
        if row.topic1 == 4:
            counter1 +=1
            rows_list.append([row[1], row.topic1, row.topic2])

    if counter2 < bucket1_topic2:
        if row.topic2 == 4 and row.topic1 !=4:
            counter2 +=1
            if [row[1], row.topic1, row.topic2] not in rows_list:
                rows_list.append([row[1], row.topic1, row.topic2])

Desired result:

        topic1 topic2 
name1    1      4
name2    4      4
name3    4      3
name5    2      4

Aucun commentaire:

Enregistrer un commentaire