Warm tip: This article is reproduced from stackoverflow.com, please click
pandas python

Aggregating events. Reducing df size. In search of more elegant way

发布于 2020-03-27 15:42:13

I have a timeseries dataframe that has the next structure:

Loc | Event |       Start        |         End       |  
 A  |  aaa  |2018-08-30 00:26:29 |2018-08-30 00:26:59|  
 A  |  aaa  |2018-08-30 00:26:30 |2018-08-30 00:26:47|  
 A  |  aaa  |2018-08-30 00:38:05 |2018-08-30 00:39:09|  
 A  |  aaa  |2018-08-30 00:40:31 |2018-08-30 00:40:41|  
 A  |  aaa  |2018-08-30 00:57:17 |2018-08-30 00:57:28|
 A  |  aaa  |2018-08-30 00:57:36 |2018-08-30 00:58:07|  
 A  |  aaa  |2018-08-30 00:57:53 |2018-08-30 00:59:23|  
 A  |  aaa  |2018-08-30 00:57:58 |2018-08-30 00:58:11|  
 A  |  aaa  |2018-08-30 00:58:00 |2018-08-30 00:58:08|  
 A  |  aaa  |2018-08-30 01:27:58 |2018-08-30 01:28:58|  

My goal is to aggregate group of events into one event based on the interval between events. As you can see from the example above, there are many records that last several seconds and are generated also within few seconds. Such events must be combined into one taking the start time of the first event in group and last time of the last event in such group.

enter image description here

As a result, the next output must be achieved:

Loc | Event |       Start        |         End       |  
 A  |  aaa  |2018-08-30 00:26:29 |2018-08-30 00:26:59|  
 A  |  aaa  |2018-08-30 00:26:30 |2018-08-30 00:26:47|  
 A  |  aaa  |2018-08-30 00:38:05 |2018-08-30 00:39:09|  
 A  |  aaa  |2018-08-30 00:40:31 |2018-08-30 00:40:41|  
 A  |  aaa  |2018-08-30 00:57:17 |2018-08-30 00:58:08|
 A  |  aaa  |2018-08-30 01:27:58 |2018-08-30 01:28:58|  

For now I managed to achieve this with many additional variables and for loops and it's quite slow. So any ideas how to achieve this with pandas methods are very welcome.

Questioner
Ison
Viewed
14
Zaraki Kenpachi 2020-01-31 17:37

To solve this you need:

  1. create while loop to go thru each datetime range
  2. slice you df with new datetime range
  3. get results from sliced df
  4. calculate new out variable for while loop
  5. collect df with results into list and then concat them

Here You go (some comments in code):

import pandas as pd
from io import StringIO
from dateutil.relativedelta import relativedelta


data = StringIO("""
Loc|Event|Start|End
A|aaa|2018-08-30 00:26:29|2018-08-30 00:26:59
A|aaa|2018-08-30 00:26:30|2018-08-30 00:26:47
A|aaa|2018-08-30 00:38:05|2018-08-30 00:39:09
A|aaa|2018-08-30 00:40:31|2018-08-30 00:40:41
A|aaa|2018-08-30 00:57:17|2018-08-30 00:57:28
A|aaa|2018-08-30 00:57:36|2018-08-30 00:58:07
A|aaa|2018-08-30 00:57:53|2018-08-30 00:59:23
A|aaa|2018-08-30 00:57:58|2018-08-30 00:58:11
A|aaa|2018-08-30 00:58:00|2018-08-30 00:58:08
A|aaa|2018-08-30 01:27:58|2018-08-30 01:28:58
""")

# load data into data frame
df = pd.read_csv(data, sep='|')

# convert string to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])


def get_result(df_filtered):
    df2 = pd.DataFrame({'Loc':df_filtered['Loc'].unique(), 'Event':df_filtered['Event'].unique(), 'Start':df_filtered['Start'].min(), 'End':df_filtered['End'].max()})
    start_end = df_filtered['Start'].max()

    return df2, start_end


# get datetime group range
min_datetime = df['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# define variables for while loop
end_end = df['Start'].max()
start_end = min_datetime

results_list = []
while end_end > start_end:
    # filter rows by dates
    df_filtered = df[(df['Start'] >= min_datetime) & (df['End'] < max_datetime)]
    # get result and new start datetime
    df2, start_end = get_result(df_filtered)
    # get new values for max and min datetime
    df_start = df[df['Start'] > start_end]
    min_datetime = df_start['Start'].min()
    max_datetime = min_datetime + relativedelta(minutes=6)
    # append df with results
    results_list.append(df2)


df = pd.concat(results_list)

Output:

  Loc Event               Start                 End
0   A   aaa 2018-08-30 00:26:29 2018-08-30 00:26:59
0   A   aaa 2018-08-30 00:38:05 2018-08-30 00:40:41
0   A   aaa 2018-08-30 00:57:17 2018-08-30 00:59:23
0   A   aaa 2018-08-30 01:27:58 2018-08-30 01:28:58