Warm tip: This article is reproduced from serverfault.com, please click

Having trouble on retrieving max values in a pyspark dataframe

发布于 2020-06-18 16:41:00

After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns

from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))

I am trying to group by with the same group and select the maximum value of the average values like this:

grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))

However, when I debug, I see that the calculated maximum values are wrong. For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.

I'd like to ask whether I am taking a false approach or missing something trivial. Any comments are appreciated.

Questioner
berkin
Viewed
0
berkin 2020-12-04 07:25:21

I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.