Azure Databricks keeps notebooks in its cloud account free of cost?

Dhruv Singla 105 Reputation points
2024-08-22T18:27:40.5233333+00:00

I was reading the architecture of Databricks from the following video. And I came through this

zfWJwb5n

 If notebooks are stored in the control plane on the Databricks cloud account, then does that mean Databricks allocates storage for the client free of cost? I want to confirm that.

Also, is anything else allocated for the client? I know all the compute and storage are billed from the client storage account, and a workspace storage bucket is also allocated from the customer account. I'm just curious to know if anything else is there in the architecture.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,211 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 90,146 Reputation points Microsoft Employee
    2024-08-23T05:26:19.4633333+00:00

    @Dhruv Singla - Thanks for the question and using MS Q&A platform.

    Azure Databricks keeps notebooks in its cloud account free of cost?

    Yes, there will be cost associated when you choose the default option or configuring to store interactive notebook results in customer account.

    Let's us understand with an example from Azure Databricks pricing page?

    Depending on the type of workload your cluster runs, you will either be charged for Jobs Compute or All-Purpose Compute workload. For example, if the cluster runs workloads triggered by the Databricks jobs scheduler, you will be charged for the Jobs Compute workload. If your cluster runs interactive features such as ad-hoc commands, you will be billed for All-Purpose Compute workload.

    • If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 Pay as You Go instances, the billing would be the following for All-Purpose Compute workload:
      • VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598
        • DBU cost for All-Purpose Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2.0 DBU per instance per hour x $0.55/DBU = $1,100
          • The total cost would therefore be $598 (VM Cost) + $1,100 (DBU Cost) = $1,698.
          • If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 Pay as You Go instances, the billing would be the following for Jobs Compute workload:
            • VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598
              • DBU cost for Jobs Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2.0 DBU per instance per hour x $0.30/DBU = $600
                • The total cost would therefore be $598 (VM Cost) + $600 (DBU Cost) = $1,198.
                • If you run Premium tier cluster for 100 hours in East US 2 with 10 DS13v2 Pay as You Go instances, the billing would be the following for Jobs Light Compute workload:
                  • VM cost for 10 DS13v2 instances —100 hours x 10 instances x $0.598/hour = $598
                    • DBU cost for Jobs Light Compute workload for 10 DS13v2 instances —100 hours x 10 instances x 2.0 DBU per instance per hour x $0.22/DBU = $440
                      • The total cost would therefore be $598 (VM Cost) + $440 (DBU Cost) = $1,038.

    Important: In addition to VM and DBU charges, you may also be charged for managed disks, public IP address, or any other resource such as Azure Storage, Azure Cosmos DB depending on your application. The VMs provisioned for any cluster will be charged from the VM "Starting" phase until the cores are no longer allocated to the virtual machine. Visit the Virtual Machine pricing page for details and compute pricing.

    According to the official documentation: Configure notebook result storage location

    Your organization’s privacy requirements may require that you store all interactive notebook results in the workspace storage account in your cloud account, rather than the Databricks-managed control plane default location where some notebook command results are stored.

    Option1: By default, notebook are stored in the Azure Databricks (The objects stored in the Workspace root folder are folders, notebooks, files).

    Option2: You can configure to store interactive notebook results in customer account:

    You can configure your workspace to store all interactive notebook results in your Azure subscription, rather than the control plane. You can enable this feature using the admin settings page or REST API. This configuration has no effect on notebooks run as jobs, whose results are already stored in your Azure subscription by default.

    User's image

    Important Note: We don't recommend to store the any data or notebooks in Default DBFS Folders.

    Deploying Azure Databricks creates an additional resource group in the background, which includes a ADLS gen2 storage account.

    Reason: Do not Store any Production Data in Default DBFS Folders

    Why we do not Store any Production Data in Default DBFS Folders?

    Azure Databricks uses the DBFS root directory as a default location for some workspace actions. Databricks recommends against storing any production data or sensitive information in the DBFS root. This article focuses on recommendations to avoid accidental exposure of sensitive data on the DBFS root.

    User's image

    Educate users not to store data on DBFS root?

    Because the DBFS root is accessible to all users in a workspace, all users can access any data stored here. It is important to instruct users to avoid using this location for storing sensitive data. The default location for managed tables in the Hive metastore on Azure Databricks is the DBFS root; to prevent end users who create managed tables from writing to the DBFS root, declare a location on external storage when creating databases in the Hive metastore.

    For more details, refer to Recommendations for working with DBFS root.

    From Azure Databricks Best Practices: Do not Store any Production Data in Default DBFS Folders

    User's image

    Reason for recommending to store data in other storage accounts (ADLS gen2) than storing in storage account associated with Azure Databricks workspace.

    Important reason: Your organization’s privacy requirements may require that you store all interactive notebook results in the workspace storage account in your cloud account, rather than the Databricks-managed control plane default location where some notebook command results are stored.

    These are the some other reasons to consider not to use the default option for storage any data on DBFS root.

    Reason1: You don't have write permission, when you use the same storage account externally via Storage Explorer.

    Reason 2: You cannot use the same storage accounts for another ADB workspace or use the same storage account linked service for Azure Data Factory or Azure synapse workspace.

    Reason 3: In future, you decided to use Azure Synapse workspaces or MS Fabric than using Azure databricks.

    Reason 4: What if you want to delete the existing workspace, in that case you will lose your data.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.