Hello Gracia Espelt, Pol,
Welcome to the Microsoft Q&A forum
External tables are metadata definitions that map to files in a data lake, and they can benefit from partitioning, indexing, and caching to improve query performance. External tables can also be used in joins and can be queried using SQL and Spark APIs.
Working with files in Azure Databricks, several options are available. You can read and write files using various formats like Parquet, CSV, JSON, and more. You can also use different storage options, such as Azure Blob Storage and Azure Data Lake Storage.
When working with files in Azure Databricks, you have more control over the data processing pipeline and can optimize the pipeline for your specific use case.
Regarding performance, If the data is small, storing it as a file is usually faster than storing it as an external table. However, if the data is large, storing it as an external table can be faster because it allows for more efficient querying and processing of the data.
So, storing data as an external table can be more efficient for large datasets.
For your example:
There may be differences between reading a file and reading an external table, but the differences are usually small. When reading a file, Spark needs to infer the schema of the data, which can take some time. When reading an external table, Spark already knows the data schema, so this step is not needed. However, this difference is usually negligible.
Some reference documents:
https://community.databricks.com/t5/data-engineering/creating-an-external-table-reference-vs-creating-a-view/td-p/9826
https://learn.microsoft.com/en-us/azure/hdinsight/spark/optimize-data-storage
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/optimize-write-for-apache-spark
I hope this helps. Please let me know if you have any further questions.
If this answers your question, please consider accepting the answer by hitting the Accept answer and up-vote as it helps the community look for answers to similar questions.