So far what I found from SO or affiliate sites are not working gracefully or not working with my testing on Databricks, maybe I didn't see it here.
Here is the need again:
I have Avg_Open_By_Year, Avg_High_By_Year, Avg_Low_By_Year and Avg_Close_By_Year, all of them have a common column of 'Year'.
So I want to join the three together to get a final df like:
Year, Open, High, Low, Close
At the moment I have to use the ugly way to join them on column 'Year':
finalDF = Avg_Open_By_Year.join(Avg_High_By_Year, on=['Year'], how='left_outer').join(Avg_Low_By_Year, on=['Year'], how='left_outer').join(Avg_Close_By_Year, on=['Year'], how='left_outer')
I think there should be a grace way to accomplish this, like UnionAll in SQL.
There is a possible solution here https://datascience.stackexchange.com/questions/11356/merging-multiple-data-frames-row-wise-in-pyspark/11361#11361, the selected answer is described below:
from functools import reduce # For Python 3.x
from pyspark.sql import DataFrame
def unionAll(*dfs):
return reduce(DataFrame.unionAll, dfs)
unionAll(td2, td3, td4, td5, td6, td7, td8, td9, td10)
However, I am doing this in Databricks notebook, it throws me error:
NameError: name 'functools' is not defined
It would really be appreciated if someone can shed me with more light. Thank you very much.
As mentioned in the comments by @Mohamed, you have to import functools in order to use it.
import functools