Warm tip: This article is reproduced from serverfault.com, please click

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

发布于 2020-11-25 13:34:16

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.

We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:

Cell1:

%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet" 
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")

The cell following this can now start doing SQL queries. e.g.:

SELECT * FROM RESULTS LIMIT 5

This takes a bit of time to get up, but not too much. I'm concerned about two things:

  1. Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.

  2. Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?

Questioner
Alain
Viewed
0
Raphael K 2020-12-08 03:47:34

For your first question:

Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.

As a quick example, you can create a table using SQL or Python:

# SQL
CREATE TABLE <example-table>(id STRING, value STRING)

# Python
dataframe.write.saveAsTable("<example-table>")

Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.

# SQL
SELECT * FROM <example-table>

# Python
spark.sql("SELECT * FROM <example-table>")

For your second question:

Performance depends on multiple factors but in general, here are some tips.

  1. If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
  2. Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
  3. Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.

I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.