I have Precipitation Data (1800 rows and 15k columns) for each cell and date.
486335 486336 486337
2019-07-03 13:35:54.445 0 2 22
2019-07-04 13:35:54.445 0 1 1
2019-07-05 13:35:54.445 16 8 22
2019-07-06 13:35:54.445 0 0 0
2019-07-07 13:35:54.445 0 11 0
I want to find dates where a specific amount of rain (>15mm) was reached and count the days after this event were less rain (<1,1mm) occurred. Together with the amount of rain, start and end period, cell and other information stored in a new DataFrame.
I write a for loop that does this job but it took several days to finish ;(. I am a beginner of python so maybe there are some tips for other methods.
from datetime import datetime, timedelta, date
import datetime
import pandas as pd
#Existing Data
index_dates = pd.date_range(pd.datetime.today(), periods=10).tolist()
df = pd.DataFrame({'486335':[0,0,16,0,0,0,2,1,8,2],'486336':[2,1,8,0,11,16,0,1,6,8],'486337':[22,1,22,0,0,0,5,3,6,1]},index=index_dates)
columns = df.columns
counter_columns = 0
iteration = -1 #Iterations Steps
counter = 10 #10 precipitation values per column
duration = 0 #days with no or less than pp_max_1 rain
count = False
index_list = df.index #Index for updating df / Integear
period_range = 0 #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0 #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event
for x in df:
for y in df[x]:
iteration = iteration + 1
if iteration == counter:
iteration = 0
counter_columns = counter_columns + 1
print("column :",counter_columns, "finished")
if y >= pp_min and count == False:
duration = duration + 1
count = True
start_period = index_list[iteration]
event_amount = y
index = iteration
pp_sum = pp_sum + y
elif y >= pp_min and count == True or y >= pp_max_1 and count == True:
end_period = index_list[iteration]
dry_periods = dry_periods.append({"start_period":start_period ,"end_period":end_period,"period_range":duration,"period_amount":pp_sum ,"event_amount":event_amount, "cell":columns[counter_columns]},ignore_index=True).sort_values('period_range',ascending=False)
duration = 0
count = False
pp_sum = 0
elif pp <= pp_max_1 and count == True:
duration = duration + 1
pp_sum = pp_sum + y
else:
continue
print(dry_periods)
The output looks like this
start_period end_period period_range \
0 2019-07-05 13:15:05.545 2019-07-09 13:15:05.545 4
1 2019-07-05 13:15:05.545 2019-07-09 13:15:05.545 4
2 2019-07-05 13:15:36.569 2019-07-09 13:15:36.569 4
3 2019-07-05 13:15:36.569 2019-07-09 13:15:36.569 4
4 2019-07-05 13:16:16.372 2019-07-09 13:16:16.372 4
5 2019-07-05 13:16:16.372 2019-07-09 13:16:16.372 4
period_amount event_amount cell
0 16.0 16 486335
1 22.0 22 486337
2 16.0 16 486335
3 22.0 22 486337
4 16.0 16 486335
5 22.0 22 486337
You could avoid iteration on rows, because it does not scale well for large dataframes.
This is a different approach, unsure whether it will be more efficient for your complete dataframe:
periods=[]
for cell in df.columns:
sub = pd.DataFrame({'amount': df[cell].values}, index=df.index)
sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf],
labels=[0, 1]).astype(np.float)
sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum()
sub.flag.ffill(inplace=True)
x = sub[sub.flag>0].reset_index().groupby('flag').agg(
{'index':['min', 'max'], 'amount': 'sum'})
x.columns = ['start', 'end', 'amount']
x['period_range'] = (x.end - x.start).dt.days + 1
x['cell'] = cell
x.reindex(columns=['start', 'end', 'period_range', 'cell'])
periods.append(x)
resul = pd.concat(periods).reset_index(drop=True)
Nice! do you actually need the ffill above? wouldn't it give the same result, if you skip the loc[sub.flag>0 instead and also sum over the zeros?
The period length is the length measured from the beginning of the first period to the end of the last period, right?
@jottbe: the problem is that any value between 0.11 and 15 will interrupt a current dry period without starting a new group.
The period lenght is the number of days between the start of an event and the last day of that event + 1.
Nice solution. I haven't come across pd.cut so far. I'm sure it will make my life simpler. But how can you handle the case when several events occur in the same column? or does it already do that?