Databricks working with files vs working with external tables

Gracia Espelt, Pol 20 Reputation points
2023-07-20T10:09:24.52+00:00

Dear community,

I have a question regarding the performance of working with external tables vs working directly with files.

Let me set an example:

I have an azure databricks notebook that performs an ETL process: loads the data from the data source (blob storage) to spark dataframes, performs some transformations and finally stores the result in another blob storage to be consumed in another databricks processes.

When storing the results of this transformations, I can choose whether to save it only as a file:

df.write.format('parquet').option('path','abfss:xxx@xxxx').save()

Or to store it as an external table:

df.write.format('parquet').option('path','abfss:xxx@xxx').saveAsTable('example')

On the next step in the data process, in another notebook, will there be any performance difference if I read the file vs if I load it as a table (taking into account that I will run the notebook with pyspark code, not sql).
df=spark.read.load(path, format = 'parquet')

vs

df=spark.read.table('example')

Thank you very much in advance,

Best regards

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,843 questions
0 comments No comments
{count} votes

Accepted answer
  1. BhargavaGunnam-MSFT 23,586 Reputation points Microsoft Employee
    2023-07-20T22:19:15.88+00:00

    Hello Gracia Espelt, Pol,

    Welcome to the Microsoft Q&A forum

    External tables are metadata definitions that map to files in a data lake, and they can benefit from partitioning, indexing, and caching to improve query performance. External tables can also be used in joins and can be queried using SQL and Spark APIs.

    Working with files in Azure Databricks, several options are available. You can read and write files using various formats like Parquet, CSV, JSON, and more. You can also use different storage options, such as Azure Blob Storage and Azure Data Lake Storage.

    When working with files in Azure Databricks, you have more control over the data processing pipeline and can optimize the pipeline for your specific use case.

    Regarding performance, If the data is small, storing it as a file is usually faster than storing it as an external table. However, if the data is large, storing it as an external table can be faster because it allows for more efficient querying and processing of the data.

    So, storing data as an external table can be more efficient for large datasets.

    For your example:

    There may be differences between reading a file and reading an external table, but the differences are usually small. When reading a file, Spark needs to infer the schema of the data, which can take some time. When reading an external table, Spark already knows the data schema, so this step is not needed. However, this difference is usually negligible.

    Some reference documents:

    https://community.databricks.com/t5/data-engineering/creating-an-external-table-reference-vs-creating-a-view/td-p/9826
    https://learn.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage

    https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/optimize-write-for-apache-spark

    I hope this helps. Please let me know if you have any further questions.

    If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful