I have a timeseries dataframe that has the next structure:
Loc | Event | Start | End |
A | aaa |2018-08-30 00:26:29 |2018-08-30 00:26:59|
A | aaa |2018-08-30 00:26:30 |2018-08-30 00:26:47|
A | aaa |2018-08-30 00:38:05 |2018-08-30 00:39:09|
A | aaa |2018-08-30 00:40:31 |2018-08-30 00:40:41|
A | aaa |2018-08-30 00:57:17 |2018-08-30 00:57:28|
A | aaa |2018-08-30 00:57:36 |2018-08-30 00:58:07|
A | aaa |2018-08-30 00:57:53 |2018-08-30 00:59:23|
A | aaa |2018-08-30 00:57:58 |2018-08-30 00:58:11|
A | aaa |2018-08-30 00:58:00 |2018-08-30 00:58:08|
A | aaa |2018-08-30 01:27:58 |2018-08-30 01:28:58|
My goal is to aggregate group of events into one event based on the interval between events. As you can see from the example above, there are many records that last several seconds and are generated also within few seconds. Such events must be combined into one taking the start time of the first event in group and last time of the last event in such group.
As a result, the next output must be achieved:
Loc | Event | Start | End |
A | aaa |2018-08-30 00:26:29 |2018-08-30 00:26:59|
A | aaa |2018-08-30 00:26:30 |2018-08-30 00:26:47|
A | aaa |2018-08-30 00:38:05 |2018-08-30 00:39:09|
A | aaa |2018-08-30 00:40:31 |2018-08-30 00:40:41|
A | aaa |2018-08-30 00:57:17 |2018-08-30 00:58:08|
A | aaa |2018-08-30 01:27:58 |2018-08-30 01:28:58|
For now I managed to achieve this with many additional variables and for loops and it's quite slow. So any ideas how to achieve this with pandas methods are very welcome.
To solve this you need:
Here You go (some comments in code):
import pandas as pd
from io import StringIO
from dateutil.relativedelta import relativedelta
data = StringIO("""
Loc|Event|Start|End
A|aaa|2018-08-30 00:26:29|2018-08-30 00:26:59
A|aaa|2018-08-30 00:26:30|2018-08-30 00:26:47
A|aaa|2018-08-30 00:38:05|2018-08-30 00:39:09
A|aaa|2018-08-30 00:40:31|2018-08-30 00:40:41
A|aaa|2018-08-30 00:57:17|2018-08-30 00:57:28
A|aaa|2018-08-30 00:57:36|2018-08-30 00:58:07
A|aaa|2018-08-30 00:57:53|2018-08-30 00:59:23
A|aaa|2018-08-30 00:57:58|2018-08-30 00:58:11
A|aaa|2018-08-30 00:58:00|2018-08-30 00:58:08
A|aaa|2018-08-30 01:27:58|2018-08-30 01:28:58
""")
# load data into data frame
df = pd.read_csv(data, sep='|')
# convert string to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
def get_result(df_filtered):
df2 = pd.DataFrame({'Loc':df_filtered['Loc'].unique(), 'Event':df_filtered['Event'].unique(), 'Start':df_filtered['Start'].min(), 'End':df_filtered['End'].max()})
start_end = df_filtered['Start'].max()
return df2, start_end
# get datetime group range
min_datetime = df['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# define variables for while loop
end_end = df['Start'].max()
start_end = min_datetime
results_list = []
while end_end > start_end:
# filter rows by dates
df_filtered = df[(df['Start'] >= min_datetime) & (df['End'] < max_datetime)]
# get result and new start datetime
df2, start_end = get_result(df_filtered)
# get new values for max and min datetime
df_start = df[df['Start'] > start_end]
min_datetime = df_start['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# append df with results
results_list.append(df2)
df = pd.concat(results_list)
Output:
Loc Event Start End
0 A aaa 2018-08-30 00:26:29 2018-08-30 00:26:59
0 A aaa 2018-08-30 00:38:05 2018-08-30 00:40:41
0 A aaa 2018-08-30 00:57:17 2018-08-30 00:59:23
0 A aaa 2018-08-30 01:27:58 2018-08-30 01:28:58
funny. it works great with an example data, but when I try to reproduce it with "real" data it returns empty arrays or throws 'arrays must all be same length' error although they are of same length...Anyway, thanks for great job! That really helped me
@Ison You're welcome! With this error check if you data don't have any hidden white string symbols.
lol, unfortunately I do not know pandas well enough to say why it helped, but avoiding using unique() method in loc and event solved the problem at least for now