温馨提示:本文翻译自stackoverflow.com，查看原文请点击：其他 - Machine learning with spark, data preparation performance problem, MLeap

apache-spark apache-spark-mllib machine-learning performance scoring

其他 - 带有 Spark 的机器学习，数据准备性能问题，MLeap

发布于 2020-03-27 11:00:45

我发现有关Mleap的许多良好反馈-一个可以快速评分的库。它基于模型工作，并转换为MLeap捆绑包。

但是在得分之前的数据准备阶段会怎样？

是否有某种有效的方法可以将“ Spark ML数据准备管道”（在培训期间有效，但在spark框架中）转换为健壮的，性能有效的，优化的字节码？

提问者

Vladimir Nabokov

被浏览

134

查看英文版

查看原文

Marsellus Wallace 2019-08-06 23:43

您可以使用MLeap轻松地序列化整个PipelineModel（包含要素工程和模型训练）。

注意：以下代码有点旧，您现在可能可以访问更干净的API。

// Mleap PipelineModel Serialization into a single .zip file
val sparkBundleContext = SparkBundleContext().withDataset(pipelineModel.transform(trainData))
for(bundleFile <- managed(BundleFile(s"jar:file:${mleapSerializedPipelineModel}"))) {
  pipelineModel.writeBundle.save(bundleFile)(sparkBundleContext).get
}

// Mleap code: Deserialize model from local filesystem (without any Spark dependency)
val mleapPipeline = (for(bf <- managed(BundleFile(s"jar:file:${modelPath}"))) yield {
  bf.loadMleapBundle().get.root
}).tried.get

请注意，棘手的部分是如果您在Spark中定义自己的Estimators / Transformers，因为它们也需要相应的MLeap版本。

相关问题

1

为什么random.shuffle比使用sorted函数要慢得多？

2

连接JavaScript中字符串的最有效方法？

3

临时变量的效率（例如，java）

4

在Select语句中使用IsNull函数的性能问题

5

大O，将一系列n个数相加的复杂度是多少？

6

如何同时迭代NSManagedObject项的集合以提高性能？

7

在发生系统异常的情况下，我应该使用什么状态来执行Sentry？

8

调用func时为什么会有内存分配

9

将行插入MySQL数据库的最有效方法

10

为什么处理排序数组要比处理未排序数组快？

热门github

1

2

the LLM vulnerability scanner

3

🚀 Efficient implementations of state-of-the-art linear attention models

4

Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks

5

📚 从零开始的大语言模型原理与实践教程

6

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. (翻译：🤗Transformers：用于 Pytorch、TensorFlow 和 JAX 的最先进的机器学习。)

7

An open-source C++ library developed and used at Facebook. (翻译：Facebook 开发和使用的开源 C++ 库。)

8

Protocol Buffers - Google's data interchange format (翻译：Protocol Buffers - Google 的数据交换格式)

9

Generate code from the terminal!

10

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python. (翻译：此存储库用于主动开发 Azure SDK for Python。)

11

🧩 Patches for ReVanced (翻译：🧩ReVanced 维护的官方补丁)

12

Recursive-Open-Meta-Agent v0.1 (Beta). A meta-agent framework to build high-performance multi-agent systems.

13

The Go language implementation of gRPC. HTTP/2 based RPC

14

Modern Backend Framework that unifies APIs, background jobs, workflows, and AI Agents into a single core primitive with built-in observability and state management.

15

the elegant TypeScript UI framework