Warm tip: This article is reproduced from serverfault.com, please click

azure-databricks databricks pyspark sql

What is the most efficient way to load a Parquet file for SQL queries in Azure Databricks?

发布于 2020-11-25 13:34:16

Our team drops parquet files on blob, and one of their main usages is to allow analysts (whose comfort zone is SQL syntax) to query them as tables. They will do this in Azure Databricks.

We've mapped the blob storage and can access the parquet files from a notebook. Currently, they are loaded and "prepped" for SQL querying in the following way:

Cell1:

%python
# Load the data for the specified job
dbutils.widgets.text("JobId", "", "Job Id")
results_path = f"/mnt/{getArgument("JobId")}/results_data.parquet" 
df_results = spark.read.load(results_path)
df_results.createOrReplaceTempView("RESULTS")

The cell following this can now start doing SQL queries. e.g.:

SELECT * FROM RESULTS LIMIT 5

This takes a bit of time to get up, but not too much. I'm concerned about two things:

Am I loading this in the most efficient way possible, or is there a way to skip the creation of the df_results dataframe, which is only used to create the RESULTS temp table.
Am I loading the table for SQL in a way that lets it be used most efficiently? For example, if the user plans to execute a few dozen queries, I don't want to re-read from disk each time if I have to, but there's no need to persist beyond this notebook. Is createOrReplaceTempView the right method for this?

Questioner

Alain

Viewed

0

Raphael K 2020-12-08 03:47:34

For your first question:

Yes, you can use the Hive Metastore on Databricks and query any tables in there without first creating DataFrames. The documentation on Databases and Tables is a fantastic place to start.

As a quick example, you can create a table using SQL or Python:

# SQL
CREATE TABLE <example-table>(id STRING, value STRING)

# Python
dataframe.write.saveAsTable("<example-table>")

Once you've created or saved a table this way, you'll be able to access it directly in SQL without creating a DataFrame or temp view.

# SQL
SELECT * FROM <example-table>

# Python
spark.sql("SELECT * FROM <example-table>")

For your second question:

Performance depends on multiple factors but in general, here are some tips.

If your tables are large (tens, hundreds of GB at least), you can partition by a predicate commonly used by your analysts to filter data. For example, if you typically include a WHERE clause that includes a date range or state, it might make sense to partition the table by one of those columns. The key concept here is data skipping.
Use Delta Lake to take advantage of OPTIMIZE and ZORDER. OPTIMIZE helps right-size files for Spark and ZORDER improves data skipping.
Choose Delta Cache Accelerated instace types for the cluster that your analysts will be working on.

I know you said there's no need to persist beyond the notebook but you can improve performance by creating persistent tables and taking advantage of data skipping, caching, and so on.

热门帖子

1

求推荐 3k 左右的烘干机

2

老板问我为什么不加班，我要怎么回他

3

RouterOS 如何切换 DNS 服务器

4

胖猫，你又不是第一次没人要

5

求助一个排查了半年没解决的 MySQL order by 子句导致索引失效的问题， 500 多万条记录的小表要查快两分钟

6

Surge ponte 如何实现异地访问家里的内网设备

7

[TestFlight] 月更：竹蜻蜓 0.17.4 现已释出

8

Macbook M1 升级 M3 MAX

9

电信疯狂 QoS 香港的下行？

10

请假一个关于一加 12 电池优化的问题

热门github

1

A multi-platform library for OpenGL, OpenGL ES, Vulkan, window and input

2

Dev tool that writes scalable apps from scratch while the developer oversees the implementation

3

shadcn/ui, but for Svelte. ✨

4

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.

5

Performance-portable, length-agnostic SIMD with runtime dispatch

6

ZK Credo

7

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

8

Joplin - the secure note taking and to-do app with synchronisation capabilities for Windows, macOS, Linux, Android and iOS.

9

Mamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers. It is based on the line of progress on structured state space models, with an efficient hardware-aware design and implementation in the spirit of FlashAttention.

10

This repository contains System Design resources which are useful while preparing for interviews and learning Distributed Systems

11

Curso para aprender el lenguaje de programación Python desde cero y para principiantes. 75 clases, 37 horas en vídeo, código, proyectos y grupo de chat. Fundamentos, frontend, backend, testing, IA...

12

🎓 Path to a free self-taught education in Computer Science!

13

1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

14

A collective list of free APIs

15

📚 Freely available programming books