samedi 14 août 2021

How to use pandas apply to replace iterrows?

I am calculating the sentiment value on every row in the dataset based on news headline. I used iterrows to achieve this:

field = 'headline'
dfp = pd.DataFrame(columns=('pos', 'neg', 'neu'))

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")

model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

for index, row in df.iterrows():
    text = row[field]
    encoded_input = tokenizer(text, return_tensors='pt')
    output = model(**encoded_input)
    probs = torch.nn.functional.softmax(output[0], dim=-1)
    probs_arr = probs.cpu().detach().numpy()
    dfp = dfp.append({'pos': probs_arr[0][0],
                      'neg': probs_arr[0][1],
                      'neu': probs_arr[0][2]
                     }, ignore_index=True)

However, the processing time is taking too long (>30 minutes runtime and it is not done yet). I have 16.6k rows in my dataset.

This is a small section of the dataset:

    datetime            headline
0   2020-03-17 16:57:07 12 best noise-cancelling headphones: In-ear an...
1   2020-06-08 14:00:55 5G Stocks To Buy And Watch: Pricing of 5G Smar...
2   2020-06-19 10:00:00 10 best wireless printers that will make your ...
3   2020-08-19 00:00:00 Apple Confirms Solid New iOS 14 Security Move ...
4   2020-08-19 00:00:00 Apple Becomes First U.S. Company Worth More Th...

I have read that iterrows is not recommended in most situation unless the dataset is small and optimization is not a concern. The alternative to it, it seem, is to use apply since apply go through each pandas row and is optimized.

Some of the SO topics I read suggested to put create a function and run it in apply. This is what I attempted:

def calPred(text):
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
probs = torch.nn.functional.softmax(output[0], dim=-1)
probs_arr = probs.cpu().detach().numpy()
dfp = dfp.append({'pos': probs_arr[0][0],
                  'neg': probs_arr[0][1],
                  'neu': probs_arr[0][2]
                 }, ignore_index=True)

df['headline'].apply(lambda x: calPred(x))

It returned an error UnboundLocalError: local variable 'dfp' referenced before assignment.

Appreciate if someone can guide me on how to optimize and use apply correctly. Thanks in advance.

Aucun commentaire:

Enregistrer un commentaire