Warm tip: This article is reproduced from stackoverflow.com, please click
apache-spark pyspark-sql

Is there any performance (or other) benefit to loading less columns in pyspark dataframe?

发布于 2020-04-08 09:21:15

Is there any performance (or other) benefit to loading less columns in pyspark dataframe?

Basically my use case is that I have a large table (many rows, many columns) that I am loading in as dataframe to filter down another table based on matching keys in both so something like...

filter_table = sparksession.read.load("/some/path/to/files").select("PK").dropDuplicates()
table_to_filter = table_to_filter.join(filter_table.select("PK"), "PK", "leftsemi")

My question is: Is there any benefit to loading the table like this

filter_table = sparksession.read.load("/some/path/to/files").select("PK")

vs

filter_table = sparksession.read.load("/some/path/to/files")

I suspect I am getting confused on how spark's lazy evaluation works (very new to using spark), but I would think that since I only ever use the table with .select("PK") there would be no difference (unless the entire dataframe is stored in memory once loaded (and not only on evaluation))?

Questioner
lampShadesDrifter
Viewed
68
Salim 2020-02-01 05:46

There is definitely performance benefit in reading few columns, the degree of benefit varies based on the data format and source.

If you are using a columnar data source like Parquet then it helps a lot by reading only relevant column groups. It reduces IO, memory footprint and time taken to deserialize data. Same benefit goes for columnar database.

If the data source is not columnar like text, csv, avro files or databases like Oracle, MS Sql then it won't reduce IO, however you may benefit from lesser memory footprint and data transfer cost for databases. There mayn't be significant benefit for reading non-columnar files.

It may add complexity to your code, specially if you are using Dataset which is supported by a case class. If you select few columns then it won't match with the underlying case class. If you are using dataframe then there is not much of an issue.