Ingest data from cloud object storage
This article lists the ways you can configure incremental ingestion from cloud object storage.
Add data UI
To learn how to use the add data UI to create a managed table from data in cloud object storage, see Load data using a Unity Catalog external location.
Notebook or SQL editor
This section describes options for configuring incremental ingestion from cloud object storage using a notebook or the Databricks SQL editor.
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without additional setup. Auto Loader provides a Structured Streaming source called cloudFiles
. Given an input directory path on the cloud file storage, the cloudFiles
source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
COPY INTO
With COPY INTO, SQL users can idempotently and incrementally ingest data from cloud object storage into Delta tables. You can use COPY INTO
in Databricks SQL, notebooks, and Databricks Jobs.
When to use COPY INTO and when to use Auto Loader
Here are a few things to consider when choosing between Auto Loader and COPY INTO
:
If you’re going to ingest files in the order of thousands over time, you can use
COPY INTO
. If you are expecting files in the order of millions or more over time, use Auto Loader. Auto Loader requires fewer total operations to discover files compared toCOPY INTO
and can split the processing into multiple batches, which means that Auto Loader is less expensive and more efficient at scale.If your data schema is going to evolve frequently, Auto Loader provides better primitive data types around schema inference and evolution. See Configure schema inference and evolution in Auto Loader for more details.
Loading a subset of re-uploaded files can be a bit easier to manage with
COPY INTO
. With Auto Loader, it’s harder to reprocess a select subset of files. However, you can useCOPY INTO
to reload the subset of files while an Auto Loader stream is running simultaneously.For an even more scalable and robust file ingestion experience, Auto Loader enables SQL users to leverage streaming tables. See Load data using streaming tables in Databricks SQL.
For a brief overview and demonstration of Auto Loader and COPY INTO
, watch the following YouTube video (2 minutes).
Automate ETL with Delta Live Tables and Auto Loader
You can simplify deployment of scalable, incremental ingestion infrastructure with Auto Loader and Delta Live Tables. Delta Live Tables does not use the standard interactive execution found in notebooks, instead it emphasizes deployment of infrastructure ready for production.
Third-party ingestion tools
Databricks validates technology partner integrations that enable you to ingest from various sources, including cloud object storage. These integrations enable low-code, scalable data ingestion from a variety of sources into Azure Databricks. See Technology partners. Some technology partners are featured in What is Databricks Partner Connect?, which provides a UI that simplifies connecting third-party tools to your lakehouse data.