I found a lot of good responses about Mleap - a library, allowing fast scoring. It works on a basis of a model, converted into MLeap bundle.
But what with data preparation stage before scoring?
Is there some effective approach to convert 'spark ML data preparation pipeline' (which is working during training, but in spark framework) to a robust, performance effective, optimized byte-code?
You can easily serialize your entire PipelineModel (containing both feature engineering and model training) with MLeap.
NOTE: The following code is a bit old and you probably have access to a cleaner API now..
// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}
// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
bf.loadMleapBundle().get.root
}).tried.get
Be aware that the tricky part is if you define your own Estimators/Transformers in Spark as they will need a corresponding MLeap version as well.