scala 例外：功能必须为org.apache.spark.ml.linalg.VectorUDT类型

Oli 2020-02-01 00:12

基本上，您尝试将数据集拆分为训练和测试，组合特征，运行PCA，然后通过分类器来预测。总体逻辑是正确的，但是您的代码存在一些问题。

Spark中的PCA需要组装的功能。您创建了一个，但未在代码中使用它。
您将名称命名features为汇编器的输出，并且已经有一个名为该方式的列。由于不使用它，因此不会看到错误，但是如果您使用此错误，则会得到此异常：

java.lang.IllegalArgumentException: Output column features already exists.

运行分类时，您至少需要指定的输入setFeaturesCol要素和要尝试学习的标签setLabelCol。您未指定标签，默认情况下，标签为"label"。您没有以这种方式命名的任何列，因此异常 Spark 会抛出。

这是您要尝试执行的工作示例。

// a funky dataset with 3 features (`x1`, `x2`, `x`3) and a label `y`,
// the class we are trying to predict.
val dataset = spark.range(10)
    .select('id as "x1", rand() as "x2", ('id * 'id) as "x3")
    .withColumn("y", (('x2 * 3 + 'x1) cast "int").mod(2))
    .cache()

// splitting the dataset, that part was ok ;-)
val Array(train, test) = dataset
      .randomSplit(Array(0.7, 0.3), seed = 1234L)
      .map(_.cache())

// An assembler, the output name cannot be one of the inputs.
val assembler = new VectorAssembler()
    .setInputCols(Array("x1", "x2", "x3"))
    .setOutputCol("features")

// A pca, that part was ok as well
val pca = new PCA()
    .setInputCol("features")
    .setK(2)
    .setOutputCol("pcaFeatures")

// A LogisticRegression classifier. (KNN is not part of spark's standard API, but
// requires the same minimum information: features and label)
val classifier = new LogisticRegression()
    .setFeaturesCol("pcaFeatures")
    .setLabelCol("y")

// And the full pipeline
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)

salma R 2020-02-01 00:09:57

好的，我尝试了，但是发现错误“找不到：值rand”。

Oli 2020-02-01 00:13:49

加 import org.apache.spark.sql.functions_.

salma R 2020-02-01 00:27:49

谢谢。但是字段“ x1”不存在。

salma R 2020-02-01 03:16:03

是否有必要将文件转换为LibSVM？

Oli 2020-02-01 18:10:34

关于x1，我知道您的数据中不存在它。这只是一个例子。必须更换x1，x2，x3你的功能，并y通过您的标签。

scala - 例外：功能必须为org.apache.spark.ml.linalg.VectorUDT类型

相关问题

热门github