Is there any performance (or other) benefit to loading less columns in pyspark dataframe?

Salim 2020-02-01 05:46

There is definitely performance benefit in reading few columns, the degree of benefit varies based on the data format and source.

If you are using a columnar data source like Parquet then it helps a lot by reading only relevant column groups. It reduces IO, memory footprint and time taken to deserialize data. Same benefit goes for columnar database.

If the data source is not columnar like text, csv, avro files or databases like Oracle, MS Sql then it won't reduce IO, however you may benefit from lesser memory footprint and data transfer cost for databases. There mayn't be significant benefit for reading non-columnar files.

It may add complexity to your code, specially if you are using Dataset which is supported by a case class. If you select few columns then it won't match with the underlying case class. If you are using dataframe then there is not much of an issue.

Salim 2020-02-03 21:38:45

Can you please explain the down vote so that I can improve the answer? The is correct and derived from my experience.

Related issues

Error while applying to_date() and add_months function in Spark SQL

Rewrite UDF to pandas udf with ArrayType column

How to run C algorithm on Spark cluster?

Does Databricks cluster need to be always up for VACUUM operation of Delta Lake?

Add values into existing nested json in a Spark DataFrame column

Combining Spark schema without duplicates using PySpark?

Pyspark: Extracting rows of a dataframe where value contains a string of characters

PySpark data skewness with Window Functions

How to connect master and slaves in Apache-Spark? (Standalone Mode)

spark-submit on AWS EMR runs but fails on accessing S3