This article provides an overview of security controls and configurations for deployment and management of Azure Databricks accounts and workspaces. For information about securing your data, see Data governance best practices.
Not all security features are available on all pricing tiers. See the Azure Databricks pricing page to learn how features align to pricing plans.
This article mentions the term data plane, which is the compute layer of the Azure Databricks platform. In the context of this article, data plane refers to the Classic data plane in your Azure subscription. By contrast, the Serverless data plane that supports serverless SQL warehouses (Public Preview) runs in the Azure subscription of Azure Databricks. To learn more, see Serverless compute.
Accounts and workspaces
In Azure Databricks, a workspace is an Azure Databricks deployment in the cloud that functions as the unified environment that a specified set of users use for accessing all of their Azure Databricks assets. Your organization can choose to have multiple workspaces or just one, depending on your needs.
An Azure Databricks account represents a single entity for purposes of billing and support. An account can include multiple workspaces.
Account admins handle general account management and workspace admins manage the settings and features of individual workspaces in the account. To learn more about Azure Databricks admins, see Azure Databricks administration guide. Admins can deploy workspaces with security configurations including:
- Deploy a workspace in your own virtual network
- Deploy a workspace with secure cluster connectivity
- Enable Azure Private Link
- Enable double encryption for DBFS
- Enable customer-managed keys for encryption
The default deployment of Azure Databricks creates a new virtual network that is managed by Microsoft. You can choose to create a new workspace in your own customer-managed virtual network (also known as VNet injection) instead, enabling you to:
- Connect Azure Databricks to other Azure services (such as Azure Storage) in a more secure manner using service endpoints or private endpoints.
- Connect to on-premises data sources for use with Azure Databricks, taking advantage of user-defined routes.
- Connect Azure Databricks to a network virtual appliance to inspect all outbound traffic and take actions according to allow and deny rules configured in user-defined routes.
- Configure Azure Databricks to use custom DNS.
- Configure network security group (NSG) rules to specify egress traffic restrictions.
- Deploy Azure Databricks clusters in your existing VNet.
The VNet must include two subnets dedicated to your Azure Databricks workspace: a container subnet and a host subnet. You cannot share subnets across workspaces or deploy other Azure resources on the subnets that are used by your Azure Databricks workspace. If you have multiple workspaces in one VNet, it’s critical to plan your address space within the VNet. To learn more about deploying a workspace in your own virtual network, see Deploy Azure Databricks in your Azure virtual network (VNet injection).
You can create a new workspace with secure cluster connectivity. When secure cluster connectivity is enabled, customer virtual networks have no open ports and Databricks Runtime cluster nodes have no public IP addresses. This simplifies network administration by removing the need to configure ports on security groups or network peering. To learn more about deploying a workspace with secure cluster connectivity, see Secure cluster connectivity (No Public IP / NPIP).
Private Link provides private connectivity from Azure VNets and on-premises networks to Azure services without exposing the traffic to the public network. Azure Databricks supports two different Private Link connection types:
- Front-end Private Link: A front-end Private Link connection allows users to connect to the Azure Databricks web application, REST API, and Databricks Connect API over a VNet interface endpoint.
- Back-end Private Link: This enables private connectivity from Databricks Runtime clusters to a Azure Databricks workspace’s core services.
For more information, see Enable Azure Private Link.
Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. DBFS is implemented as a storage account in your Azure Databricks workspace’s managed resource group. The default storage location in DBFS is known as the DBFS root.
Azure Storage automatically encrypts all data in a storage account, including DBFS root storage. You can optionally enable encryption at the Azure Storage infrastructure level. When infrastructure encryption is enabled, data in a storage account is encrypted twice, once at the service level and once at the infrastructure level, with two different encryption algorithms and two different keys. To learn more about deploying a workspace with infrastructure encryption, see Configure double encryption for DBFS root.
Azure Databricks supports adding a customer-managed key to help protect and control access to data. There are three customer-managed key features for different types of data:
Customer-managed keys for managed disks: Azure Databricks compute workloads in the data plane store temporary data on Azure managed disks. By default, data stored on managed disks is encrypted at rest using server-side encryption with Microsoft-managed keys. You can configure your own key for your Azure Databricks workspace to use for managed disk encryption. See Configure customer-managed keys for Azure managed disks.
Customer-managed keys for managed services: Managed services data in the Azure Databricks control plane is encrypted at rest. You can add a customer-managed key for managed services to help protect and control access to the following types of encrypted data:
- Notebook source files that are stored in the control plane.
- Notebook results for notebooks that are stored in the control plane.
- Secrets stored by the secret manager APIs.
- Databricks SQL queries and query history.
- Personal access tokens or other credentials used to set up Git integration with Databricks Repos.
Customer-managed keys for DBFS root: By default, the storage account is encrypted with Microsoft-managed keys. You can configure your own key to encrypt all the data in the workspace’s root storage account. For more information, see Configure customer-managed keys for DBFS root.
For more details of which customer-managed key features in Azure Databricks protect different types kinds of data, see Customer-managed keys for encryption.
Users, groups, and service principals are configured in the Azure Databricks account and workspaces by administrators. For information on how to securely configure identity in Azure Databricks, see Identity best practices.
For REST API authentication, you can use either built-in revocable Azure Databricks personal access tokens or use revocable Azure Active Directory tokens. As a security best practice, Databricks recommends using Azure Active Directory tokens for service principals to authenticate to automated tools, systems, scripts, and apps. For details, see Authenticate using Azure Active Directory tokens.
Workspace admins can use the Token Management API to review current Azure Databricks personal access tokens, delete tokens, and set the maximum lifetime of new tokens for their workspace. You can use the related Permissions API to control which users can create and use tokens to access workspace REST APIs.
IP access lists
Authentication proves user identity, but it does not enforce the network location of the users. Accessing a cloud service from an unsecured network poses security risks, especially when the user may have authorized access to sensitive or personal data. With IP access lists, you can configure Azure Databricks workspaces so that users connect to the service only through existing networks with a secure perimeter.
Workspace admins can specify the IP addresses (or CIDR ranges) on the public network that are allowed access. These IP addresses could belong to egress gateways or specific user environments. You can also specify IP addresses or subnets to block. For details, see IP access lists.
You can also use PrivateLink to block all public internet access to a Azure Databricks workspace.
Azure Databricks provides access to audit logs (also known as diagnostic logs) of activities performed by Azure Databricks users, allowing you to monitor detailed usage patterns. You can configure these to flow to your Azure Storage Account, Azure Log Analytics workspace, or Azure Event Hub. See Diagnostic logging in Azure Databricks.
You can use cluster policies to enforce particular cluster settings, such as instance types, number of nodes, attached libraries, and compute cost, and display different cluster-creation interfaces for different user levels. Managing cluster configurations using policies can help enforce universal governance controls and manage the costs of your compute infrastructure. For more information, see Manage cluster policies.
In Azure Databricks, you can use access control lists (ACLs) to configure permission to access objects, such as: notebooks, experiments, models, clusters, jobs, dashboards, queries, and SQL warehouses. All admin users can manage access control lists, as can users who have been given delegated permissions to manage access control lists. See Access control.
For information about managing access to your organization’s data, see Data governance guide.
You can use Databricks secrets to store credentials and reference them in notebooks and jobs. A secret is a key-value pair that stores secret material for an external data source or other calculation, with a key name unique within a secret scope. You should never hard code secrets or store them in plain text.
You create secrets using either the REST API or CLI, but you must use the Secrets utility (dbutils.secrets) in a notebook or job to read your secrets.
You can also use Databricks secrets to reference secrets stored in an Azure Key Vault. Secrets are stored encrypted at rest, but you can add a customer-managed key to add additional security. See Enable customer-managed keys for encryption.
For information on how to use Databricks secrets, see Secret management.
Using Databricks REST APIs, some of your security configuration tasks can be automated. To assist Databricks customers with typical deployment scenarios, there are ARM templates that you can use to deploy workspaces. Particularly for large companies with dozens of workspaces, using templates can enable fast and consistent automated workspace deployments.
Here are some resources to help you build a comprehensive security solution that meets your organization’s needs:
- The Databricks Security and Trust Center, which provides information about the ways in which security is built into every layer of the Databricks Lakehouse Platform.
- Security Best Practices, which provides a checklist of security practices, considerations and patterns that you can apply to your deployment, learned from our enterprise engagements.
- Data governance best practices to implement data governance controls for your organization.