Using Auto Loader with Unity Catalog

Auto Loader can securely ingest data from external locations configured with Unity Catalog. To learn more about securely connecting storage with Unity Catalog, see Connect to cloud object storage and services using Unity Catalog. Auto Loader relies on Structured Streaming for incremental processing; for recommendations and limitations see Using Unity Catalog with Structured Streaming.

Note

In Databricks Runtime 11.3 LTS and above, you can use Auto Loader with either shared or single user access modes.

Directory listing mode is supported by default. File notification mode is only supported on single user compute.

Ingesting data from external locations managed by Unity Catalog with Auto Loader

You can use Auto Loader to ingest data from any external location managed by Unity Catalog. You must have READ FILES permissions on the external location.

Note

Azure Data Lake Storage Gen2 is the only Azure storage type supported by Unity Catalog.

Specifying locations for Auto Loader resources for Unity Catalog

The Unity Catalog security model assumes that all storage locations referenced in a workload will be managed by Unity Catalog. Databricks recommends always storing checkpoint and schema evolution information in storage locations managed by Unity Catalog. Unity Catalog does not allow you to nest checkpoint or schema inference and evolution files under the table directory.

Examples

The follow examples assume the executing user has owner privileges on the target tables and the following configurations and grants:

Storage location Grant
abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data READ FILES
abfss://dev-bucket@<storage-account>.dfs.core.windows.net READ FILES, WRITE FILES, CREATE TABLE

Using Auto Loader to load to a Unity Catalog managed table

checkpoint_path = "abfss://dev-bucket@<storage-account>.dfs.core.windows.net/_checkpoint/dev_table"

(spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .load("abfss://autoloader-source@<storage-account>.dfs.core.windows.net/json-data")
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .trigger(availableNow=True)
  .toTable("dev_catalog.dev_database.dev_table"))