温馨提示:本文翻译自stackoverflow.com，查看原文请点击：apache spark - How to read from a csv file to create a scala Map object?

apache-spark scala rdd

apache spark - 如何从csv文件读取以创建Scala Map对象？

发布于 2020-03-27 11:47:12

我有一个要读取的csv的路径。该csv包括三列：“主题，键，值”我正在使用spark将此文件读取为csv文件。该文件如下所示（lookupFile.csv）：

Topic,Key,Value
fruit,aaa,apple
fruit,bbb,orange
animal,ccc,cat
animal,ddd,dog

//I'm reading the file as follows
val lookup = SparkSession.read.option("delimeter", ",").option("header", "true").csv(lookupFile)

我想拿我刚刚读的东西，并返回一个具有以下属性的 map：

该 map使用主题作为关键
该映射的值是“键”和“值”列的映射

我希望得到一张如下图的 map：

val result = Map("fruit" -> Map("aaa" -> "apple", "bbb" -> "orange"),
                 "animal" -> Map("ccc" -> "cat", "ddd" -> "dog"))

关于如何执行此操作的任何想法？

提问者

jjaguirre394

被浏览

169

查看英文版

查看原文

mikeL 2019-07-03 23:21

读入您的数据

val df1= spark.read.format("csv").option("inferSchema", "true").option("header", "true").load(path)

首先将“键，值”放入array和groupBy主题中，以将目标分为键部分和值部分。

val df2= df.groupBy("Topic").agg(collect_list(array($"Key",$"Value")).as("arr"))

现在转换为数据集

val ds= df2.as[(String,Seq[Seq[String]])]

在字段上应用逻辑以获取 map并收集

val ds1 =ds.map(x=> (x._1,x._2.map(y=> (y(0),y(1))).toMap)).collect

现在，您已将“主题”作为键，将“键，值”作为“值”来设置数据，因此现在应用“ map”获取结果

ds1.toMap

Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))

相关问题

1

在Spark SQL中应用to_date（）和add_months函数时出错

2

使用ArrayType列将UDF重写为pandas udf

3

如何在Spark集群上运行C算法？

4

Delta Lake的VACUUM操作是否需要始终启用Databricks集群？

5

将值添加到Spark DataFrame列中的现有嵌套json中

6

使用PySpark合并没有重复的Spark模式？

7

Pyspark：提取数据框的行，其中值包含字符串

8

具有窗口功能的PySpark数据偏度

9

如何在Apache-Spark中连接主机和从机？

10

AWS EMR上的spark-submit运行但在访问S3时失败

热门github

1

real time face swap and one-click video deepfake with only a single image

2

A quick example of how one can "synchronize" a 3d scene across multiple windows using three.js and localStorage

3

ChatGPT DAN, Jailbreaks prompt

4

21 Lessons, Get Started Building with Generative AI 🔗 https://microsoft.github.io/generative-ai-for-beginners/ (翻译：12 节课程，开始使用生成式 AI 进行构建)

5

Curated list of project-based tutorials (翻译：收藏了基于项目的教程列表)

6

Truly independent web browser

7

Python - 100天从新手到大师

8

An open source payments switch written in Rust to make payments fast, reliable and affordable (翻译：YOLOv8 🚀 in PyTorch > ONNX > CoreML > TFLite)

9

Agent S: an open agentic framework that uses computers like a human

10

Master programming by recreating your favorite technologies from scratch. (翻译：在这个项目中，你能学会如何创造自己的各种工具，引擎，游戏，框架，库......)

11

Jelly Evolution Simulator

12

Collection of leaked system prompts

13

🤯 Lobe Chat - an open-source, modern-design AI chat framework. Supports Multi AI Providers( OpenAI / Claude 3 / Gemini / Ollama / DeepSeek / Qwen), Knowledge Base (file upload / knowledge management / RAG ), Multi-Modals (Plugins/Artifacts) and Thinking. One-click FREE deployment of your private ChatGPT/ Claude / DeepSeek application. (翻译：LobeChat 是开源的高性能聊天机器人框架，支持语音合成、多模态、可扩展的（Function Call）插件系统。)