Databricks File Notification Mode Resources

Hiran Amarathunga 95 Reputation points
2024-08-02T03:01:28.23+00:00

I'm implementing File Notification Mode for autoloader. And, I want to get estimate for new resource usage for cost analysis.

I use following code and using multiple sources in different jobs, but same azure credentials.

Ex:

Job 1 : source_data_1

Job 2 : source_data_2

What are the new resources creating for file notification mode? Will it create new event grids for each source or will it be single event grid?

Why do we need Contributor access to storage account? (Will it incur more cost?)

Thanks community!

df = spark.readStream.format("cloudFiles")\
    .option("cloudFiles.format", "avro")\
    .option("badRecordsPath", bad_records)\
    .option("cloudFiles.schemaLocation", avro_schema_location)\
    .option("cloudFiles.maxFileAge", "90 days")\
    .option("cloudFiles.backfillInterval", "1 day")\
    .option("cloudFiles.subscriptionId",az_subscriptionId)\
    .option("cloudFiles.tenantId",az_tenantId)\
    .option("cloudFiles.clientId",az_clientId)\
    .option("cloudFiles.clientSecret",az_clientSecret)\
    .option("cloudFiles.resourceGroup",az_resourceGroup)\
    .option("cloudFiles.useNotifications","true")\
    .load(source_data_1)
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,441 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA 90,616 Reputation points Moderator
    2024-08-05T09:09:00.2533333+00:00

    @Hiran Amarathunga - Thanks for the question and using MS Q&A platform.

    To answer your questions:

    When you use File Notification Mode for multiple sources in different jobs, it will create a separate event grid for each source. This is because each source has its own set of files and needs to be monitored separately.

    Contributor access to the storage account is required for the Spark job to access the files in the storage account. This access is needed to read the files and write the output to the destination. It does not incur additional cost, but it is important to ensure that the Spark job has the necessary permissions to access the files.

    Let's understand everthing in detailed:

    What is Auto Loader file notification mode?

    In file notification mode, Auto Loader automatically sets up a notification service and queue service that subscribes to file events from the input directory. You can use file notifications to scale Auto Loader to ingest millions of files an hour. When compared to directory listing mode, file notification mode is more performant and scalable for large input directories or a high volume of files but requires additional cloud permissions.

    What are the cloud resources used in Auto Loader file notification mode?

    You need elevated permissions to automatically configure cloud infrastructure for file notification mode. Contact your cloud administrator or workspace admin. See:

    Auto Loader can set up file notifications for you automatically when you set the option cloudFiles.useNotifications to true and provide the necessary permissions to create cloud resources. In addition, you might need to provide additional options to grant Auto Loader authorization to create these resources.

    The following table summarizes which resources are created by Auto Loader.

    User's image

    Incremental ingestion using Auto Loader with Delta Live Tables?

    Databricks recommends Auto Loader in Delta Live Tables for incremental data ingestion. Delta Live Tables extends functionality in Apache Spark Structured Streaming and allows you to write just a few lines of declarative Python or SQL to deploy a production-quality data pipeline with:

    You do not need to provide a schema or checkpoint location because Delta Live Tables automatically manages these settings for your pipelines. See Load data with Delta Live Tables.

    Databricks also recommends Auto Loader whenever you use Apache Spark Structured Streaming to ingest data from cloud object storage. APIs are available in Python and Scala.

    Benefits of Auto Loader over using Structured Streaming directly on files?

    In Apache Spark, you can read files incrementally using spark.readStream.format(fileFormat).load(directory). Auto Loader provides the following benefits over the file source:

    • Scalability: Auto Loader can discover billions of files efficiently. Backfills can be performed asynchronously to avoid wasting any compute resources.
    • Performance: The cost of discovering files with Auto Loader scales with the number of files that are being ingested instead of the number of directories that the files may land in. See What is Auto Loader directory listing mode?.
    • Schema inference and evolution support: Auto Loader can detect schema drifts, notify you when schema changes happen, and rescue data that would have been otherwise ignored or lost. See How does Auto Loader schema inference work?.
    • Cost: Auto Loader uses native cloud APIs to get lists of files that exist in storage. In addition, Auto Loader’s file notification mode can help reduce your cloud costs further by avoiding directory listing altogether. Auto Loader can automatically set up file notification services on storage to make file discovery much cheaper.

    For more details, refer to the below links:
    https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/file-notification-mode

    https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.