vendredi 6 décembre 2019

Code-Optimization: Better way than if else loop

I am quite new to python (and to stackoverflow), so please be a little patient with me, if I have missed anything!

My current problem is as follows:

I have two datasets. The first dataset “insideMarke1” contains the Best Bid / Best Ask (=BBA) data for a given day, starting at 8.00 am. Every time the BBA is updated, there is a new row with a new timestamp.

The second dataset “orders1” contains order data. The order entry timestamps (‘Entry date and time’) in this dataset could be older than the oldest timestamp in the “insideMarke1” dataset. Each row contains one order. There are more orders in “orders1” than BBAs in “insideMarket1” (because not every order entry leads to a BBA update).

What I want to achieve is to create a new dataset “orders1Complete”, with is a combination of both datasets: For every order in “orders1”, I want the Best Bid / Best Ask Price at the moment when the order was entered. If the order entry timestamp is older than the oldest timestamp in the “insideMarke1” dataset, I just want the oldest BBA for these orders.

Therefore, I first added a new column to the “orders1” dataset, called ‘BBA_Timestamp’. And then I merged both datasets on this column. As a want the BBA timestamp at the moment before the order was entered, it is necessary to consider that the BBA timestamp has to be older than the order ‘Entry Date and Time’ timestamp.

Below stands my code:

import pandas as pd


insideMarke1 = pd.DataFrame({'Timestamp':['2019-12-01 08:00:00.123456', '2019-12-01 08:00:01.123456', '2019-12-01 08:00:02.123456', '2019-12-01 08:00:03.123456', '2019-12-01 08:00:05.123456'],
                             'bestBidQnt':[100, 100, 50, 50, 100],
                             'bestBidPrice':[50.01, 50.01, 50.02, 50.02, 50.01],
                             'bestAskPrice':[51.00, 50.99, 50.99, 50.50, 50.50],
                             'bestAskQnt':[200, 100, 100, 200, 200]})

orders1 = pd.DataFrame({'Entry Date and Time':['2019-11-30 17:29:50.000000','2019-12-01 07:30:01.112233', '2019-12-01 08:00:00.123456', '2019-12-01 08:00:00.512341', '2019-12-01 08:00:01.123456', '2019-12-01 08:00:02.123456', '2019-12-01 08:00:02.987654', '2019-12-01 08:00:03.123456', '2019-12-01 08:00:04.000000', '2019-12-01 08:00:05.123456'],
                       'Bid':['True', 'True', 'False', 'False', 'False', 'True', 'True', 'False', 'True', 'True'],
                       'Price':[49.00, 49.50, 51.00, 51.50, 50.99, 50.02, 48.00, 50.50, 49.00, 50.01 ],
                       'Qnt':[50, 100, 200, 150, 100, 50, 10, 200, 80, 100 ]})


insideMarke1[['Timestamp']] = insideMarke1[['Timestamp']].apply(pd.to_datetime, unit='ns') 
orders1[['Entry Date and Time']] = orders1[['Entry Date and Time']].apply(pd.to_datetime, unit='ns')


orders1['BBA_Timestamp'] = 0
i = 0
minInsideMarke1 = min(insideMarke1['Timestamp'])

for i in range(len(orders1['Entry Date and Time'])):
    if orders1['Entry Date and Time'][i] <= minInsideMarke1:
        orders1['BBA_Timestamp'][i] = minInsideMarke1
    else:
        insideMarke1_temp = insideMarke1.loc[(insideMarke1['Timestamp'] < orders1['Entry Date and Time'][i])]
        orders1['BBA_Timestamp'][i] = min(insideMarke1_temp['Timestamp'], key=lambda x: abs(orders1['Entry Date and Time'][i]-x))

orders1 = orders1.reset_index(drop=True)
insideMarke1 = insideMarke1.reset_index(drop=True)

insideMarke1 = insideMarke1.drop_duplicates('Timestamp', keep='last')
insideMarke1.rename(columns={'Timestamp': 'BBA_Timestamp'}, inplace=True)
insideMarke1[['BBA_Timestamp']] = insideMarke1[['BBA_Timestamp']].apply(pd.to_datetime, unit='ns')
insideMarke1 = insideMarke1.sort_values(by=['BBA_Timestamp']).reset_index(drop=True)
orders1[['BBA_Timestamp']] = orders1[['BBA_Timestamp']].apply(pd.to_datetime, unit='ns')
orders1 = orders1.sort_values(by=['BBA_Timestamp']).reset_index(drop=True)

orders1Complete = pd.merge_asof(orders1, insideMarke1, on='BBA_Timestamp')

The code works fine, but it is soooo sloooow. My data sets contain several thousand lines and I have to do this for several of those data sets...

I would be very very glad for any help, tips, advices,...

Aucun commentaire:

Enregistrer un commentaire