温馨提示:本文翻译自stackoverflow.com,查看原文请点击:scala - How to create multiples columns from a MapType columns efficiently (without foldleft)
apache-spark apache-spark-sql scala

scala - 如何有效地从MapType列创建多个列(无左折)

发布于 2020-04-06 00:15:15

我的目标是从另一MapType创建列列的名称是Map的键及其关联的值。

在我的起始数据框下方:

+-----------+---------------------------+
|id         |         mapColumn         |
+-----------+---------------------------+
| 1         |Map(keyA -> 0, keyB -> 1)  |
| 2         |Map(keyA -> 4, keyB -> 2)  |
+-----------+---------------------------+

低于所需的输出:

+-----------+----+----+
|id         |keyA|keyB|
+-----------+----+----+
| 1         |   0|   1|
| 2         |   4|   2|
+-----------+----+----+

我找到了一个带有累加器的Foldleft解决方案(工作但非常慢):

val colsToAdd = startDF.collect()(0)(1).asInstanceOf[Map[String,Integer]].map(x => x._1).toSeq
res1: Seq[String] = List(keyA, keyB)

val endDF = colsToAdd.foldLeft(startDF)((startDF, key) => startDF.withColumn(key, lit(0)))

//(lit(0) for testing)

真正的起始数据帧很大,我需要优化。

查看更多

提问者
Jalil Mankouri
被浏览
94
blackbishop 2020-02-02 20:31

您可以简单地使用explodefunction来展开 map类型列,然后使用pivot来获取每个键作为新列。像这样:

val df = Seq((1,Map("keyA" -> 0, "keyB" -> 1)), (2,Map("keyA" -> 4, "keyB" -> 2))
).toDF("id", "mapColumn")

df.select($"id", explode($"mapColumn"))
  .groupBy($"id")
  .pivot($"key")
  .agg(first($"value"))
  .show()

给出:

+---+----+----+
| id|keyA|keyB|
+---+----+----+
|  1|   0|   1|
|  2|   4|   2|
+---+----+----+