python - Pandas Dataframe无法通过列循环

发布于 2020-03-27 10:26:30

每个单元格和日期都有降水数据（1800行和15k列）。

                          486335  486336  486337
2019-07-03 13:35:54.445       0       2      22
2019-07-04 13:35:54.445       0       1       1
2019-07-05 13:35:54.445      16       8      22
2019-07-06 13:35:54.445       0       0       0
2019-07-07 13:35:54.445       0      11       0

我想查找达到特定降雨量（> 15mm）的日期，并计算该事件发生后的天数较少降雨（<1.1mm）。连同雨量，开始和结束时间，单元格和其他信息一起存储在新的DataFrame中。

我编写了一个for循环来完成这项工作，但是花了几天的时间才能完成;（。我是python的初学者，所以也许对于其他方法有一些技巧。

from datetime import datetime, timedelta, date
import datetime
import pandas as pd

#Existing Data
index_dates =  pd.date_range(pd.datetime.today(), periods=10).tolist()
df = pd.DataFrame({'486335':[0,0,16,0,0,0,2,1,8,2],'486336':[2,1,8,0,11,16,0,1,6,8],'486337':[22,1,22,0,0,0,5,3,6,1]},index=index_dates)
columns = df.columns 
counter_columns = 0

iteration = -1 #Iterations Steps
counter = 10 #10 precipitation values per column
duration = 0 #days with no or less than pp_max_1 rain 
count = False

index_list = df.index #Index for updating df / Integear
period_range = 0  #Amount of days after Event without much rain Integear
period_amount = 0 #Amount of PP in dry days except event Integear
event_amount = 0.0  #Amount of heavy rainfall on the event date Float
pp = 0 #actual precipitation
pp_sum = 0.0 #mm
pp_min = 15.0 #mm min pp for start to count dry days until duration_min_after
pp_max_1 = 0.11 #max pp for 1 day while counting dry days
dry_days = 0 #dry days after event

for x in df:
    for y in df[x]:
        iteration = iteration + 1
        if iteration == counter:
            iteration = 0
            counter_columns = counter_columns + 1
            print("column :",counter_columns, "finished")
        if y >= pp_min and count == False:
            duration = duration + 1
            count = True
            start_period = index_list[iteration]
            event_amount = y
            index = iteration
            pp_sum = pp_sum + y
        elif y >= pp_min and count == True or y >= pp_max_1 and count == True:
            end_period = index_list[iteration]
            dry_periods = dry_periods.append({"start_period":start_period ,"end_period":end_period,"period_range":duration,"period_amount":pp_sum ,"event_amount":event_amount, "cell":columns[counter_columns]},ignore_index=True).sort_values('period_range',ascending=False)
            duration = 0
            count = False
            pp_sum = 0
        elif pp <= pp_max_1 and count == True:
            duration = duration + 1
            pp_sum = pp_sum + y
        else:
            continue
print(dry_periods)

输出看起来像这样

start_period              end_period period_range  \
0  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
1  2019-07-05 13:15:05.545 2019-07-09 13:15:05.545            4   
2  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
3  2019-07-05 13:15:36.569 2019-07-09 13:15:36.569            4   
4  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   
5  2019-07-05 13:16:16.372 2019-07-09 13:16:16.372            4   


    period_amount event_amount    cell  
0            16.0           16  486335  
1            22.0           22  486337  
2            16.0           16  486335  
3            22.0           22  486337  
4            16.0           16  486335  
5            22.0           22  486337

提问者

till Kadabra

被浏览

287

查看英文版

查看原文

periods=[] for cell in df.columns: sub = pd.DataFrame({'amount': df[cell].values}, index=df.index) sub['flag'] = pd.cut(sub['amount'], [0.11, 15, np.inf], labels=[0, 1]).astype(np.float) sub.loc[sub.flag>0, 'flag']=sub.loc[sub.flag>0, 'flag'].cumsum() sub.flag.ffill(inplace=True) x = sub[sub.flag>0].reset_index().groupby('flag').agg( {'index':['min', 'max'], 'amount': 'sum'}) x.columns = ['start', 'end', 'amount'] x['period_range'] = (x.end - x.start).dt.days + 1 x['cell'] = cell x.reindex(columns=['start', 'end', 'period_range', 'cell']) periods.append(x) resul = pd.concat(periods).reset_index(drop=True)

jottbe 2019-07-03 21:43:41

真好！您真的需要上面的填充物吗？如果您跳过loc [sub.flag> 0并求和为零，它不会给出相同的结果吗？

jottbe 2019-07-03 21:45:20

周期长度是从第一个周期的开始到最后一个周期的末尾的长度，对吗？

Serge Ballesta 2019-07-03 21:45:42

@jottbe：问题在于，介于0.11和15之间的任何值都会中断当前的干燥时间，而无需启动新的组。

Serge Ballesta 2019-07-03 21:47:25

周期长度是事件开始到事件结束之间的天数+ 1。

jottbe 2019-07-03 21:49:55

不错的解决方案。到目前为止，我还没有遇到过pd.cut。我相信这会使我的生活更简单。但是，当同一列中发生多个事件时，您该如何处理呢？还是已经做到了？

python - Pandas Dataframe无法通过列循环

相关问题

热门github