Aggregating events. Reducing df size. In search of more elegant way

Zaraki Kenpachi 2020-01-31 17:37

To solve this you need:

create while loop to go thru each datetime range
slice you df with new datetime range
get results from sliced df
calculate new out variable for while loop
collect df with results into list and then concat them

Here You go (some comments in code):

import pandas as pd
from io import StringIO
from dateutil.relativedelta import relativedelta


data = StringIO("""
Loc|Event|Start|End
A|aaa|2018-08-30 00:26:29|2018-08-30 00:26:59
A|aaa|2018-08-30 00:26:30|2018-08-30 00:26:47
A|aaa|2018-08-30 00:38:05|2018-08-30 00:39:09
A|aaa|2018-08-30 00:40:31|2018-08-30 00:40:41
A|aaa|2018-08-30 00:57:17|2018-08-30 00:57:28
A|aaa|2018-08-30 00:57:36|2018-08-30 00:58:07
A|aaa|2018-08-30 00:57:53|2018-08-30 00:59:23
A|aaa|2018-08-30 00:57:58|2018-08-30 00:58:11
A|aaa|2018-08-30 00:58:00|2018-08-30 00:58:08
A|aaa|2018-08-30 01:27:58|2018-08-30 01:28:58
""")

# load data into data frame
df = pd.read_csv(data, sep='|')

# convert string to datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])


def get_result(df_filtered):
    df2 = pd.DataFrame({'Loc':df_filtered['Loc'].unique(), 'Event':df_filtered['Event'].unique(), 'Start':df_filtered['Start'].min(), 'End':df_filtered['End'].max()})
    start_end = df_filtered['Start'].max()

    return df2, start_end


# get datetime group range
min_datetime = df['Start'].min()
max_datetime = min_datetime + relativedelta(minutes=6)
# define variables for while loop
end_end = df['Start'].max()
start_end = min_datetime

results_list = []
while end_end > start_end:
    # filter rows by dates
    df_filtered = df[(df['Start'] >= min_datetime) & (df['End'] < max_datetime)]
    # get result and new start datetime
    df2, start_end = get_result(df_filtered)
    # get new values for max and min datetime
    df_start = df[df['Start'] > start_end]
    min_datetime = df_start['Start'].min()
    max_datetime = min_datetime + relativedelta(minutes=6)
    # append df with results
    results_list.append(df2)


df = pd.concat(results_list)

Output:

  Loc Event               Start                 End
0   A   aaa 2018-08-30 00:26:29 2018-08-30 00:26:59
0   A   aaa 2018-08-30 00:38:05 2018-08-30 00:40:41
0   A   aaa 2018-08-30 00:57:17 2018-08-30 00:59:23
0   A   aaa 2018-08-30 01:27:58 2018-08-30 01:28:58

Ison 2020-01-31 18:12:15

funny. it works great with an example data, but when I try to reproduce it with "real" data it returns empty arrays or throws 'arrays must all be same length' error although they are of same length...Anyway, thanks for great job! That really helped me

Zaraki Kenpachi 2020-01-31 18:20:14

@Ison You're welcome! With this error check if you data don't have any hidden white string symbols.

Ison 2020-01-31 18:28:06

lol, unfortunately I do not know pandas well enough to say why it helped, but avoiding using unique() method in loc and event solved the problem at least for now

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels