Warm tip: This article is reproduced from stackoverflow.com, please click
apache-spark apache-spark-sql scala

How to create multiples columns from a MapType columns efficiently (without foldleft)

发布于 2020-04-05 23:39:17

My goal is to create columns from another MapType column. The names of the columns being the keys of the Map and their associated values.

Below my starting dataframe:

+-----------+---------------------------+
|id         |         mapColumn         |
+-----------+---------------------------+
| 1         |Map(keyA -> 0, keyB -> 1)  |
| 2         |Map(keyA -> 4, keyB -> 2)  |
+-----------+---------------------------+

Below the desired output:

+-----------+----+----+
|id         |keyA|keyB|
+-----------+----+----+
| 1         |   0|   1|
| 2         |   4|   2|
+-----------+----+----+

I found a solution whith a Foldleft with accumulators (work but extremely slow):

val colsToAdd = startDF.collect()(0)(1).asInstanceOf[Map[String,Integer]].map(x => x._1).toSeq
res1: Seq[String] = List(keyA, keyB)

val endDF = colsToAdd.foldLeft(startDF)((startDF, key) => startDF.withColumn(key, lit(0)))

//(lit(0) for testing)

The real starting dataframe being enormous, I need optimization.

Questioner
Jalil Mankouri
Viewed
65
blackbishop 2020-02-02 20:31

You could simply use explode function to explode the map type column and then use pivot to get each key as new column. Something like this:

val df = Seq((1,Map("keyA" -> 0, "keyB" -> 1)), (2,Map("keyA" -> 4, "keyB" -> 2))
).toDF("id", "mapColumn")

df.select($"id", explode($"mapColumn"))
  .groupBy($"id")
  .pivot($"key")
  .agg(first($"value"))
  .show()

Gives:

+---+----+----+
| id|keyA|keyB|
+---+----+----+
|  1|   0|   1|
|  2|   4|   2|
+---+----+----+