How to union multiple dataframe in pyspark within Databricks notebook

发布于 2020-03-30 21:15:36

So far what I found from SO or affiliate sites are not working gracefully or not working with my testing on Databricks, maybe I didn't see it here.

Here is the need again:

I have Avg_Open_By_Year, Avg_High_By_Year, Avg_Low_By_Year and Avg_Close_By_Year, all of them have a common column of 'Year'.

So I want to join the three together to get a final df like: Year, Open, High, Low, Close

At the moment I have to use the ugly way to join them on column 'Year':

finalDF = Avg_Open_By_Year.join(Avg_High_By_Year, on=['Year'], how='left_outer').join(Avg_Low_By_Year, on=['Year'], how='left_outer').join(Avg_Close_By_Year, on=['Year'], how='left_outer')

I think there should be a grace way to accomplish this, like UnionAll in SQL.

There is a possible solution here https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspark/11361#11361, the selected answer is described below:

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

unionAll(td2, td3, td4, td5, td6, td7, td8, td9, td10)

However, I am doing this in Databricks notebook, it throws me error:

NameError: name 'functools' is not defined

It would really be appreciated if someone can shed me with more light. Thank you very much.

Questioner

mdivk

Viewed

Chinese

Original

How to union multiple dataframe in pyspark within Databricks notebook

Related issues