我有以下查询
val CubeData = spark.sql (""" SELECT gender, department, count(bibno) AS count FROM borrowersTable, loansTable WHERE borrowersTable.bid = loansTable.bid GROUP BY gender,department WITH CUBE ORDER BY gender,department """)
我想导出具有特定数据和名称的4个文件。
File1由性别和部门组成,文件名是gener_departments File2性别,文件名是gender_null File3部门,文件名是部门_null File4 null,文件名是null_null这些文件是sql查询的结果(带立方体) )
我尝试以下
val df1 = CubeData.withColumn("combination",concat(col("gender") ,lit(","), col("department")))
df1.coalesce(1).write.partitionBy("combination").format("csv").option("header", "true").mode("overwrite").save("final")
但我拿了4个以上的文件-性别组合-部门。这些文件的名称也是随机的。是否可以选择这些文件的名称?
也许这是Spark中的错误,你的查询中没有任何问题,但是下面的查询似乎可以正常工作。如果表名是唯一列,则无需指定它们。
val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")
但是文件解析中似乎存在一些问题,请尝试以下方法:
val borrowersDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("BORROWERS.txt")
borrowersDF.createOrReplaceTempView("borrowersTable")
val loansDF = spark.read.format("csv").option("delimiter", "|").option("header", "True").option("inferSchema", "True").load("LOANS.txt")
loansDF.createOrReplaceTempView("loansTable")
val CubeData = spark.sql ("""
SELECT gender, department, count(bibno) AS count
FROM borrowersTable
JOIN loansTable USING(bid)
GROUP BY gender, department WITH CUBE
ORDER BY gender, department
""")