Warm tip: This article is reproduced from stackoverflow.com, please click
apache-spark apache-spark-mllib machine-learning performance scoring

Machine learning with spark, data preparation performance problem, MLeap

发布于 2020-03-27 10:21:41

I found a lot of good responses about Mleap - a library, allowing fast scoring. It works on a basis of a model, converted into MLeap bundle.

But what with data preparation stage before scoring?

Is there some effective approach to convert 'spark ML data preparation pipeline' (which is working during training, but in spark framework) to a robust, performance effective, optimized byte-code?

Questioner
Vladimir Nabokov
Viewed
80
Marsellus Wallace 2019-08-06 23:43

You can easily serialize your entire PipelineModel (containing both feature engineering and model training) with MLeap.

NOTE: The following code is a bit old and you probably have access to a cleaner API now..

// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
  pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}

// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
  bf.loadMleapBundle().get.root
}).tried.get

Be aware that the tricky part is if you define your own Estimators/Transformers in Spark as they will need a corresponding MLeap version as well.