@Hiran Amarathunga - Thanks for the question and using MS Q&A platform.
To answer your questions:
When you use File Notification Mode for multiple sources in different jobs, it will create a separate event grid for each source. This is because each source has its own set of files and needs to be monitored separately.
Contributor access to the storage account is required for the Spark job to access the files in the storage account. This access is needed to read the files and write the output to the destination. It does not incur additional cost, but it is important to ensure that the Spark job has the necessary permissions to access the files.
Let's understand everthing in detailed:
What is Auto Loader file notification mode?
In file notification mode, Auto Loader automatically sets up a notification service and queue service that subscribes to file events from the input directory. You can use file notifications to scale Auto Loader to ingest millions of files an hour. When compared to directory listing mode, file notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions.
What are the cloud resources used in Auto Loader file notification mode?
You need elevated permissions to automatically configure cloud infrastructure for file notification mode. Contact your cloud administrator or workspace admin. See:
Auto Loader can set up file notifications for you automatically when you set the option cloudFiles.useNotifications
to true
and provide the necessary permissions to create cloud resources. In addition, you might need to provide additional options to grant Auto Loader authorization to create these resources.
The following table summarizes which resources are created by Auto Loader.
Incremental ingestion using Auto Loader with Delta Live Tables?
Databricks recommends Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with:
- Autoscaling compute infrastructure for cost savings
- Data quality checks with expectations
- Automatic schema evolution handling
- Monitoring via metrics in the event log
You do not need to provide a schema or checkpoint location because Delta Live Tables automatically manages these settings for your pipelines. See Load data with Delta Live Tables.
Databricks also recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from cloud object storage. APIs are available in Python and Scala.
Benefits of Auto Loader over using Structured Streaming directly on files?
In Apache Spark, you can read files incrementally using spark.readStream.format(fileFormat).load(directory)
. Auto Loader provides the following benefits over the file source:
- Scalability: Auto Loader can discover billions of files efficiently. Backfills can be performed asynchronously to avoid wasting any compute resources.
- Performance: The cost of discovering files with Auto Loader scales with the number of files that are being ingested instead of the number of directories that the files may land in. See What is Auto Loader directory listing mode?.
- Schema inference and evolution support: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost. See How does Auto Loader schema inference work?.
- Cost: Auto Loader uses native cloud APIs to get lists of files that exist in storage. In addition, Auto Loader’s file notification mode can help reduce your cloud costs further by avoiding directory listing altogether. Auto Loader can automatically set up file notification services on storage to make file discovery much cheaper.
For more details, refer to the below links:
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/file-notification-mode
https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.