mercredi 13 juillet 2016

Python - Itterows through datetime column and use time difference criteria to update subset of data

I'm a pure Python novice so please bear with me.

I have a pandas data frame, and I'd like to iterate/loop through each row and perform the following: 1) Check if the current row has the most up-to-date time, and if it is set the time as the most up-to-date time 2) Compare the current row with the previously iterated rows and remove the rows that are older than 2 hours 3) With the remaining rows, obtain and report the median number of connections of each location, and store the result and the row valuees into a dictionary that is updated after each iteration.

Let's assume this is my data frame

             Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 02:00:00
2             Boston                Hawaii 2016-03-28 04:00:00
3           New York                Boston 2016-03-28 06:00:00
4        Los Angeles                Boston 2016-03-28 04:00:00
5        Los Angeles                Hawaii 2016-03-28 08:00:00
6        Los Angeles                Boston 2016-03-28 08:00:00
7           New York           Los Angeles 2016-03-28 10:00:00
8             Boston                Hawaii 2016-03-28 06:00:00
9           New York                Boston 2016-03-28 12:00:00

Edit: I've provide code to illustrate the logic

RecentTime = [] # Store list of most recent time for each row
RecentData = [] # Store data frame of subset of rows
priortime = None
# Loop through 
for index, row in data.iterrows():
    currenttime = row['Time']
    if priortime is None:
        priortime = currenttime

    if priortime > currenttime: # If prior time is greater than current row
        currenttime = priortime

    priortime = currenttime    
    RecentTime.append(currenttime)

    RecentData.append([row]) # Append current row into RecentData subset
    if abs(row['Time'] - next_row['Time']) < timedelta(hours=2):
    RecentData[i] = data[(data['Time'] > row['Time']-pd.Timedelta('2 hours')) & (data['Time'] < row['Time'] + pd.Timedelta('2 hours'))]
    else:
            break

Using my above logic, the first iteration should look like:

most_up_to_date_time = 2016-03-28 02:00:00
median number of connections = {1,1} = 1 # Where the median connection 
                                         # for New York and Boston is 1.

             Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00

and the second, third, and fourth iterations are:

most_up_to_date_time = 2016-03-28 02:00:00 # Remains the same
median number of connections = {2,1,1} = 1 # New York has 2 connections,
# Boston/LA each have 1 connection, so the median is 1.

             Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 02:00:00                 

most_up_to_date_time = 2016-03-28 04:00:00 # Updated
median number of connections = {2,2,1,1} = 1.5 # NY/Boston have 2 
# connections, while LA/Hawaii have 1 connection, so the median is 1.5.

             Origin           Destination                Time
0           New York                Boston 2016-03-28 02:00:00
1           New York           Los Angeles 2016-03-28 02:00:00
2             Boston                Hawaii 2016-03-28 04:00:00

most_up_to_date_time = 2016-03-28 06:00:00 # Updated
median number of connections = {1,2,1} = 1 # Boston has 2 connections,
# while LA/NY have 1 connection each, so the median is 1.

             Origin           Destination                Time
2             Boston                Hawaii 2016-03-28 04:00:00
3           New York                Boston 2016-03-28 06:00:00

Notice that for the second and third iterations, the subset of data grew since the time difference between the most up-to-date row fell within the 2 hour time frame, but for the fourth iteration, the subset of data was adjusted to delete the rows that are older than 2 hours from the most up-to-date row (i.e. we kept rows 2 and 3).

Continuing onto the fifth, six, and seven iterations:

most_up_to_date_time = 2016-03-28 06:00:00 # Updated
median number of connections = {1,1,3,1} = 1

             Origin           Destination                Time
2             Boston                Hawaii 2016-03-28 04:00:00
3           New York                Boston 2016-03-28 06:00:00
4        Los Angeles                Boston 2016-03-28 04:00:00

most_up_to_date_time = 2016-03-28 08:00:00 # Updated
median number of connections = {1,1,1,1} = 1

             Origin           Destination                Time
3           New York                Boston 2016-03-28 06:00:00
5        Los Angeles                Hawaii 2016-03-28 08:00:00

most_up_to_date_time = 2016-03-28 08:00:00 # Remains the same
median number of connections = {1,2,2,1} = 1.5

             Origin           Destination                Time
3           New York                Boston 2016-03-28 06:00:00
5        Los Angeles                Hawaii 2016-03-28 08:00:00
6        Los Angeles                Boston 2016-03-28 08:00:00

Notice here that in the fifth iteration, even though the newest row is out of order, it falls within the 2 hours time frame and is used to build onto the subset of data from the fourth iteration (rows 3, 4, and 5). But in the sixth iteration, the subset of data adjusts to keep only the rows within the 2 hours time frame (rows 3 and 5), and in the seventh iteration, the subset of data builds onto the subset of data kept from the sixth iteration (rows 3, 5, and 6).

Finally, the eighth, ninth, and tenth iterations:

most_up_to_date_time = 2016-03-28 10:00:00 # Updated
median number of connections = {3,1,1,1} = 1

             Origin           Destination                Time
5        Los Angeles                Hawaii 2016-03-28 08:00:00
6        Los Angeles                Boston 2016-03-28 08:00:00
7           New York           Los Angeles 2016-03-28 10:00:00

most_up_to_date_time = 2016-03-28 10:00:00 # Remains the same
median number of connections = {3,1,1,1} = 1

             Origin           Destination                Time
5        Los Angeles                Hawaii 2016-03-28 08:00:00
6        Los Angeles                Boston 2016-03-28 08:00:00
7           New York           Los Angeles 2016-03-28 10:00:00

most_up_to_date_time = 2016-03-28 12:00:00 # Updated
median number of connections = {2,1,1} = 1

             Origin           Destination                Time
7           New York           Los Angeles 2016-03-28 10:00:00
9           New York                Boston 2016-03-28 12:00:00

Notice that in the eighth iteration, the subset of data now contains rows 5,6, and 7, and the most recent time is updated, while in the ninth iteration, row 8 is ignored because it is older than 2 hours compared to the most up-to-date time. Finally, in the tenth iteration, the subset of data only contains rows 7 and 9.

This is a little dense and detailed, so please let me know if there's any confusion.

Aucun commentaire:

Enregistrer un commentaire