How to compute cumulative sum on multiple float columns?

发布于 2020-04-03 23:43:48

I have 100 float columns in a Dataframe which are ordered by date.

ID   Date         C1       C2 ....... C100
1     02/06/2019   32.09  45.06         99
1     02/04/2019   32.09  45.06         99
2     02/03/2019   32.09  45.06         99
2     05/07/2019   32.09  45.06         99

I need to get C1 to C100 in the cumulative sum based on id and date.

Target dataframe should look like this:

ID   Date         C1       C2 ....... C100
1     02/04/2019   32.09  45.06         99
1     02/06/2019   64.18  90.12         198
2     02/03/2019   32.09  45.06         99
2     05/07/2019   64.18  90.12         198

I want to achieve this without looping from C1- C100.

Initial code for one column:

var DF1 =  DF.withColumn("CumSum_c1", sum("C1").over(
         Window.partitionBy("ID")
        .orderBy(col("date").asc)))

I found a similar question here but he manually did it for two columns : Cumulative sum in Spark

Questioner

Vikrant

Viewed

Chinese

Original

val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow) // get columns you want to sum val columnsToSum = df.drop("ID", "Date").columns // map over those columns and create new sum columns val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq df.select(selectExpr:_*).show()

+---+----------+-----+-----+----+ | ID| Date| C1| C2|C100| +---+----------+-----+-----+----+ | 1|02/04/2019|32.09|45.06| 99| | 1|02/06/2019|64.18|90.12| 198| | 2|02/03/2019|32.09|45.06| 99| | 2|05/07/2019|64.18|90.12| 198| +---+----------+-----+-----+----+

How to compute cumulative sum on multiple float columns?

Related issues