Warm tip: This article is reproduced from serverfault.com, please click

其他-Scala项目导出文件

(其他 - scala project export files)

发布于 2020-11-28 23:50:06

我有以下查询

val CubeData = spark.sql (""" SELECT gender, department, count(bibno) AS count FROM borrowersTable, loansTable  WHERE borrowersTable.bid = loansTable.bid GROUP BY gender,department WITH CUBE ORDER BY gender,department """) 

我想导出具有特定数据和名称的4个文件。

File1由性别和部门组成,文件名是gener_departments File2性别,文件名是gender_null File3部门,文件名是部门_null File4 null,文件名是null_null这些文件是sql查询的结果(带立方体) )

我尝试以下

val df1 = CubeData.withColumn("combination",concat(col("gender") ,lit(","), col("department")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("final")

但我拿了4个以上的文件-性别组合-部门。这些文件的名称也是随机的。是否可以选择这些文件的名称?

Questioner
nubie
Viewed
11
mck 2020-11-29 15:46:07

也许这是Spark中的错误,你的查询中没有任何问题,但是下面的查询似乎可以正常工作。如果表名是唯一列,则无需指定它们。

val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")

但是文件解析中似乎存在一些问题,请尝试以下方法:

val borrowersDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("BORROWERS.txt")
borrowersDF.createOrReplaceTempView("borrowersTable")
val loansDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("LOANS.txt")
loansDF.createOrReplaceTempView("loansTable")

val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")