Warm tip: This article is reproduced from serverfault.com, please click

How can I normalize full-width characters in column in spark (scala)

发布于 2020-12-02 05:52:46

I have a column in a dataframe that has full-width and half-width characters. I want to normalize the column to half-width characters but I'm not sure how it's done.

I'm trying to do this:

var normalized = df.withColumn("DomainNormalized",col(Normalizer.normalize($"Domain".toString(), Normalizer.Form.NFKC)))

I'm hoping that this would change this domain: @nlb.com (note that the b is a full-width character) to @nlb.com but the column created is not normalized.

How can I change the column content or derive a new column on the dataframe with the java normalizer?

Questioner
Samuel Lopez
Viewed
0
Wu.dior 2020-12-02 18:40:49

use udf like this

....
val rdd = sc.makeRDD(List("@nl 1.com")) 
import sparkSession.implicits._
val df = rdd.toDF("domain") 
val norm = (arg:String) => {
  val s = Normalizer.normalize(arg,Normalizer.Form.NFKC)
  s
}
val normalizer = udf(norm)
val df2 = df.withColumn("domain2",normalizer(df.col("domain")))
df2.select("domain2").show()