Warm tip: This article is reproduced from stackoverflow.com, please click
databricks dataframe pyspark union

How to union multiple dataframe in pyspark within Databricks notebook

发布于 2020-03-30 21:15:36

So far what I found from SO or affiliate sites are not working gracefully or not working with my testing on Databricks, maybe I didn't see it here.

Here is the need again:

I have Avg_Open_By_Year, Avg_High_By_Year, Avg_Low_By_Year and Avg_Close_By_Year, all of them have a common column of 'Year'.

So I want to join the three together to get a final df like: Year, Open, High, Low, Close

At the moment I have to use the ugly way to join them on column 'Year':

finalDF = Avg_Open_By_Year.join(Avg_High_By_Year, on=['Year'], how='left_outer').join(Avg_Low_By_Year, on=['Year'], how='left_outer').join(Avg_Close_By_Year, on=['Year'], how='left_outer')

I think there should be a grace way to accomplish this, like UnionAll in SQL.

There is a possible solution here https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspark/11361#11361, the selected answer is described below:

from functools import reduce  # For Python 3.x
from pyspark.sql import DataFrame

def unionAll(*dfs):
    return reduce(DataFrame.unionAll, dfs)

unionAll(td2, td3, td4, td5, td6, td7, td8, td9, td10)

However, I am doing this in Databricks notebook, it throws me error:

NameError: name 'functools' is not defined

enter image description here

It would really be appreciated if someone can shed me with more light. Thank you very much.

Questioner
mdivk
Viewed
72
Ravi 2020-01-31 18:42

As mentioned in the comments by @Mohamed, you have to import functools in order to use it.

import functools