Machine learning with spark, data preparation performance problem, MLeap

Marsellus Wallace 2019-08-06 23:43

You can easily serialize your entire PipelineModel (containing both feature engineering and model training) with MLeap.

NOTE: The following code is a bit old and you probably have access to a cleaner API now..

// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
  pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}

// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
  bf.loadMleapBundle().get.root
}).tried.get

Be aware that the tricky part is if you define your own Estimators/Transformers in Spark as they will need a corresponding MLeap version as well.

Related issues

Why is random.shuffle so much slower than using sorted function?

Most efficient way to concatenate strings in JavaScript?

Efficiency of temporary variables (e.g java)

Performance issue using IsNull function in the Select statement

How to Save up Another Precious HTTP-request for The Tiny Favicon?

Big O, what is the complexity of summing a series of n numbers?

How to iterate a collection of NSManagedObject items concurrently for boosting performance?

What status should I use for Sentry performance in case of System Exception?

Why are there memory allocations when calling a func

Most efficient way to insert Rows into MySQL Database