Warm tip: This article is reproduced from serverfault.com, please click

Spark with Scala and Pandas

发布于 2020-11-29 09:31:04

I want to use Panda's Transformations like Melt etc inside a Spark Application. I am using Scala for Spark, and I have to use some functionality like Melt from Pandas, is it possible to do that?

pd.melt() I have seen Pandas and PySpark going hand in hand in Notebooks.

Questioner
user14728672
Viewed
0
Alex Ott 2020-11-30 01:08:22

(it's hard to provide example without more details, so this answer just includes links to documentation, etc.)

In recent versions of Spark there is support for so-called Pandas UDFs where you get Pandas series or dataframe as argument and return series or arguments, so you can execute Pandas functions to get results. Pandas UDFs are much faster than traditional Python UDFs because of the optimized data serialization, etc. See documentation and this blog post for more details.

Another alternative is to use Koalas - library for Spark that is re-implementing Pandas API but is doing it on Spark. There is an implementation of the melt as well, but make sure to read documentation to understand possible differences in behavior.