Exception:features must be of type org.apache.spark.ml.linalg.VectorUDT

Oli 2020-02-01 00:12

Basically, you are trying to split the dataset into training and test, assemble features, run a PCA and then a classifier to predict something. The overall logic is correct but there are several problems with your code.

A PCA in spark needs assembled features. You created one but you do not use it in the code.
You gave the name features to the output of the assembler, and you already have a column named that way. Since you do not use it, you don't see an error but if you were you would get this exception:

java.lang.IllegalArgumentException: Output column features already exists.

When running a classification, you need to specify at the very least the input features with setFeaturesCol and the label you are trying to learn with setLabelCol. You did not specified the label and by default, the label is "label". You don't have any column named that way, hence the exception spark throws at you.

Here is a working example of what you are trying to do.

// a funky dataset with 3 features (`x1`, `x2`, `x`3) and a label `y`,
// the class we are trying to predict.
val dataset = spark.range(10)
    .select('id as "x1", rand() as "x2", ('id * 'id) as "x3")
    .withColumn("y", (('x2 * 3 + 'x1) cast "int").mod(2))
    .cache()

// splitting the dataset, that part was ok ;-)
val Array(train, test) = dataset
      .randomSplit(Array(0.7, 0.3), seed = 1234L)
      .map(_.cache())

// An assembler, the output name cannot be one of the inputs.
val assembler = new VectorAssembler()
    .setInputCols(Array("x1", "x2", "x3"))
    .setOutputCol("features")

// A pca, that part was ok as well
val pca = new PCA()
    .setInputCol("features")
    .setK(2)
    .setOutputCol("pcaFeatures")

// A LogisticRegression classifier. (KNN is not part of spark's standard API, but
// requires the same minimum information: features and label)
val classifier = new LogisticRegression()
    .setFeaturesCol("pcaFeatures")
    .setLabelCol("y")

// And the full pipeline
val pipeline = new Pipeline().setStages(Array(assembler, pca, classifier))
val model = pipeline.fit(train)

salma R 2020-02-01 00:09:57

ok,I try it but error found "not found: value rand ".

Oli 2020-02-01 00:13:49

Add import org.apache.spark.sql.functions_.

salma R 2020-02-01 00:27:49

thanks.but Field "x1" does not exist.

salma R 2020-02-01 03:16:03

Is it necessary to convert file to LibSVM?

Oli 2020-02-01 18:10:34

About x1, I know it does not exist in your data. It was just en example. You have to replace x1, x2, x3 by your features, and y by your label.

Related issues

Error while applying to_date() and add_months function in Spark SQL

Rewrite UDF to pandas udf with ArrayType column

How to run C algorithm on Spark cluster?

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

Add values into existing nested json in a Spark DataFrame column

Combining Spark schema without duplicates using PySpark?

Pyspark: Extracting rows of a dataframe where value contains a string of characters

PySpark data skewness with Window Functions

How to connect master and slaves in Apache-Spark? (Standalone Mode)

spark-submit on AWS EMR runs but fails on accessing S3