jeudi 4 février 2021

Remove word in a list from two columns if the word is in both columns in Python

data = {'A': ['word1 other stuff', 'otherstuff word1', 'hello word3 bye'],
        'B': ['foo word1', 'word2 hello', 'word2 bye']
}

df = pd.DataFrame (data, columns = ['A', 'B'])

I'm trying to remove words from two strings only if the same word (e.g. word1) is in the same row of column A and column B:

mywordslist = ["word1", "word2", "word3"]

for word in mywordslist:
    if ((word in df['A']) and (word in df['B'])):
        df['word_removed'] = 1 ## indicator if both A and B contained the word.
        df['A_new'] = df['A'].apply(lambda x: re.sub(r'\b{}\b'.format(re.escape(word)), ' ', x)
        df['A_new'] = df['A_new'].apply(lambda x: re.sub(r'\b{}$'.format(re.escape(word)), ' ', x)
        df['A_new'] = df['A_new'].apply(lambda x: re.sub(r'${}\b'.format(re.escape(word)), ' ', x)
    else: 
        print('word not in both A and B')

So in the example, word1 should be removed from the first row since word1 is in the first row of both A and B.

The code runs but even though there are many instances where the words should be removed, they are not and the indicator does not show that the word is in both strings. How should I correctly specify the if statement?

Aucun commentaire:

Enregistrer un commentaire