After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns
from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))
I am trying to group by with the same group and select the maximum value of the average values like this:
grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))
However, when I debug, I see that the calculated maximum values are wrong. For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.
I'd like to ask whether I am taking a false approach or missing something trivial. Any comments are appreciated.
I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.