Warm tip: This article is reproduced from serverfault.com, please click

apache-spark apache-spark-sql aws-glue pyspark python

Having trouble on retrieving max values in a pyspark dataframe

发布于 2020-06-18 16:41:00

After I calculate average of quantities within 5 rows for each row in a pyspark dataframe by using window and partitioning over a group of columns

from pyspark.sql import functions as F
prep_df = ...
window = Window.partitionBy([F.col(x) for x in group_list]).rowsBetween(Window.currentRow, Window.currentRow + 4)
consecutive_df = prep_df.withColumn('aveg', F.avg(prep_df['quantity']).over(window))

I am trying to group by with the same group and select the maximum value of the average values like this:

grouped_consecutive_df = consecutive_df.groupBy(group_column_list).agg(F.max(consecutive_df['aveg']).alias('aveg'))

However, when I debug, I see that the calculated maximum values are wrong. For specific instances, I saw that the retrieved max numbers are not even in the 'aveg' column.

I'd like to ask whether I am taking a false approach or missing something trivial. Any comments are appreciated.

Questioner

berkin

Viewed

0

berkin 2020-12-04 07:25:21

I could solve this by a workaround like this: Before aggregation, I mapped the max values of quantity averages to another new column, then I selected one of the rows in the group.

热门帖子

1

Android 开发方向：传统 View 开发 or 拥抱 Jetpack Compose

2

浙江 ISP 会屏蔽 Wireguard UDP 端口

3

最近很多小伙伴弄港卡，分享一下个人开卡经历

4

vercel 免费版 3 个指标都超了，好像我的网站已经获得了基础流量。还好套了 cloudflare 不然流量也超了。

5

最近换了真我手机，发现这系统连应用的数据都无法备份

6

MacBook Pro 决赛圈

7

连下厨房也要放弃网页版了

8

杭州岗位太少了

9

300 档预算求推荐一款 87 键机械键盘

10

求推荐硬路由（1000 内）

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books