apache spark-具有窗口功能的PySpark数据偏度

jxc 2020-12-10 05:08:09

要处理大分区，你可以尝试根据orderBy列（最有可能是数字列或可以转换为数字的日期/时间戳列）对它进行拆分，以便所有新的子分区都保持正确的行顺序。使用新的分区程序处理行，并使用lag和lead函数进行计算，仅需对子分区之间边界周围的行进行后处理。（下面还讨论了如何在任务2中合并小分区）

使用你的示例，sdf并假设我们具有以下WinSpec和简单的聚合函数：

w = Window.partitionBy('id').orderBy('timestamp')
df.withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w))

任务1：分割大分区：

请尝试以下方法：

选择一个Ñ到分裂时间戳并设置一个附加partitionBy柱PID（使用ceil，int，floor等等）：

# N to cover 35-days' intervals
N = 24*3600*35
df1 = sdf.withColumn('pid', F.ceil(F.unix_timestamp('timestamp')/N))

添加PID为partitionBy（见W1），然后calaulte row_number()，lag()并lead()在W1。在每个新分区中查找行数（cnt），以帮助识别分区的结尾（rn == cnt）。除每个分区边界上的那些行外，所得的new_val对于大多数行都适用。
```
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')
w2 = Window.partitionBy('id', 'pid')

df2 = df1.select(
    '*',
    F.count('*').over(w2).alias('cnt'),
    F.row_number().over(w1).alias('rn'),
    (F.lag('amt',1).over(w1) + F.lead('amt',1).over(w1)).alias('new_amt')
)
```
下面是df2显示边界行的示例。
处理边界：选择边界上的行rn in (1, cnt)以及具有计算中使用的值的行rn in (2, cnt-1)，对w执行相同的new_val计算，并仅保存边界行的结果。
```
df3 = df2.filter('rn in (1, 2, cnt-1, cnt)') \
    .withColumn('new_amt', F.lag('amt',1).over(w) + F.lead('amt',1).over(w)) \
    .filter('rn in (1,cnt)')
```
下面显示了上面df2生成的df3

将df3合并回df2以更新边界行rn in (1,cnt)

df_new = df2.filter('rn not in (1,cnt)').union(df3)

下面的屏幕截图显示了边界行周围的最终df_new：

# drop columns which are used to implement logic only
df_new = df_new.drop('cnt', 'rn')

一些注意事项：

定义了以下3个WindowSpec：

w = Window.partitionBy('id').orderBy('timestamp')          <-- fix boundary rows
w1 = Window.partitionBy('id', 'pid').orderBy('timestamp')  <-- calculate internal rows
w2 = Window.partitionBy('id', 'pid')                       <-- find #rows in a partition

注意：严格来说，我们最好使用以下内容w来修复边界行，以避免timestamp围绕边界的问题。

w = Window.partitionBy('id').orderBy('pid', 'rn')          <-- fix boundary rows

如果你知道哪些分区是歪斜的，则只需将它们分开，然后跳过其他分区即可。如果稀疏分布，现有方法可能会将一个小分区分成2个甚至更多个分区
```
df1 = df.withColumn('pid', F.when(F.col('id').isin('a','b'), F.ceil(F.unix_timestamp('timestamp')/N)).otherwise(1))
```
如果对于每个分区，你可以检索count（行数）和min_ts= min（timestamp），然后进行更动态的尝试pid（以下M是要拆分的阈值行数）：
```
F.expr(f"IF(count>{M}, ceil((unix_timestamp(timestamp)-unix_timestamp(min_ts))/{N}), 1)")
```
注意：对于分区内的偏斜，将需要生成更复杂的函数pid。
如果仅使用lag(1)函数，则仅对左边界进行后处理，过滤rn in (1, cnt)并仅更新rn == 1
```
df3 = df1.filter('rn in (1, cnt)') \
    .withColumn('new_amt', F.lag('amt',1).over(w)) \
    .filter('rn = 1')
```
当我们只需要确定正确的边界并更新时，它就类似于引导函数 rn == cnt
如果仅lag(2)使用，则使用过滤并更新更多行df3：
```
df3 = df1.filter('rn in (1, 2, cnt-1, cnt)') \
    .withColumn('new_amt', F.lag('amt',2).over(w)) \
    .filter('rn in (1,2)')
```
可以将相同的方法与两个延伸到混合箱lag和lead具有不同的偏移。

任务2：合并小分区：

根据分区中的记录数count，我们可以设置一个阈值，M以便if count>M，id拥有其自己的分区，否则我们合并分区，以使#of total records小于M（以下方法的边缘情况为2*M-2）。

M = 20000

# create pandas df with columns `id`, `count` and `f`, sort rows so that rows with count>=M are located on top
d2 = pd.DataFrame([ e.asDict() for e in sdf.groupby('id').count().collect() ]) \
    .assign(f=lambda x: x['count'].lt(M)) \
    .sort_values('f')    

# add pid column to merge smaller partitions but the total row-count in partition should be less than or around M 
# potentially there could be at most `2*M-2` records for the same pid, to make sure strictly count<M, use a for-loop to iterate d1 and set pid:
d2['pid'] = (d2.mask(d2['count'].gt(M),M)['count'].shift(fill_value=0).cumsum()/M).astype(int)

# add pid to sdf. In case join is too heavy, try using Map
sdf_1 = sdf.join(spark.createDataFrame(d2).alias('d2'), ["id"]) \
    .select(sdf["*"], F.col("d2.pid"))

# check pid: # of records and # of distinct ids
sdf_1.groupby('pid').agg(F.count('*').alias('count'), F.countDistinct('id').alias('cnt_ids')).orderBy('pid').show()
+---+-----+-------+                                                             
|pid|count|cnt_ids|
+---+-----+-------+
|  0|74837|      1|
|  1|20036|    133|
|  2|20052|    134|
|  3|20010|    133|
|  4|15065|    100|
+---+-----+-------+

现在，应该单独用pid对新Window进行分区，并将id移到orderBy，如下所示：

w3 = Window.partitionBy('pid').orderBy('id','timestamp')

根据上述w3 WinSpec自定义滞后/超前函数，然后计算new_val：

lag_w3  = lambda col,n=1: F.when(F.lag('id',n).over(w3) == F.col('id'), F.lag(col,n).over(w3))
lead_w3 = lambda col,n=1: F.when(F.lead('id',n).over(w3) == F.col('id'), F.lead(col,n).over(w3))

sdf_new = sdf_1.withColumn('new_val', lag_w3('amt',1) + lead_w3('amt',1))

Sreeram TP 2020-11-29 17:49:31

感谢您的详细回答。我会通读并从侧面进行实验，以确保我完全理解它。

Sreeram TP 2020-11-29 17:50:45

我想知道是否可以通过其他任何操作来完成类似的操作。例如，max，min，何时将（sql函数）应用于windowspec？

jxc 2020-12-02 12:56:39

@SreeramTP，我建议的方法是处理非常大的分区，这些分区无法加载到任何执行程序内存中，因此导致OOM问题。如果除3-4个相对较大的分区以外的所有数据都较小。这肯定会被夸大其词。正如我在Notes项（2）中提到的那样，我们只能将变量pid添加到那些较大的分区，例如，将阈值设置为number_of_rows_in_partition，然后找到min（timestamp）以便可以削减子分区以减少潜在的较小的第一子分区等

jxc 2020-12-02 13:10:04

同样，此答案为所有潜在情况提供了更多的方法而不是解决方案。您可能需要进行一些手动调整以适合您的实际数据。任何自动设置都会产生一些费用。例如，如果您要N基于日期范围和每个分区旁边的行数自动查找等。我在考虑偏斜问题时，会更多地考虑大数据，以避免联接，groupby等问题，但是您可能会有所不同。我会在中午左右或晚上检查此内容。

Sreeram TP 2020-12-03 07:34:13

这种解决了我的问题。谢谢

apache spark-具有窗口功能的PySpark数据偏度

(apache spark - PySpark data skewness with Window Functions)

任务1：分割大分区：

任务2：合并小分区：

热门帖子

热门github