Azure Databricks concepts
This article introduces the set of fundamental concepts you need to understand in order to use Azure Databricks effectively.
Accounts and workspaces
In Azure Databricks, a workspace is an Azure Databricks deployment in the cloud that functions as an environment for your team to access Databricks assets. Your organization can choose to have either multiple workspaces or just one, depending on its needs.
An Azure Databricks account represents a single entity that can include multiple workspaces. Accounts enabled for Unity Catalog can be used to manage users and their access to data centrally across all of the workspaces in the account.
Billing: Databricks units (DBUs)
Azure Databricks bills based on Databricks units (DBUs), units of processing capability per hour based on VM instance type.
See the Azure Databricks pricing page.
Authentication and authorization
This section describes concepts that you need to know when you manage Azure Databricks identities and their access to Azure Databricks assets.
A unique individual who has access to the system. User identities are represented by email addresses. See Manage users.
A service identity for use with jobs, automated tools, and systems such as scripts, apps, and CI/CD platforms. Service principals are represented by an application ID. See Manage service principals.
A collection of identities. Groups simplify identity management, making it easier to assign access to workspaces, data, and other securable objects. All Databricks identities can be assigned as members of groups. See Manage groups
Access control list (ACL)
A list of permissions attached to the workspace, cluster, job, table, or experiment. An ACL specifies which users or system processes are granted access to the objects, as well as what operations are allowed on the assets. Each entry in a typical ACL specifies a subject and an operation. See Access control
Personal access token
An opaque string is used to authenticate to the REST API and by tools in the Technology partners to connect to SQL warehouses. See Azure Databricks personal access tokens.
Azure Active Directory tokens can also be used to authenticate to the REST API.
The Azure Databricks UI is a graphical interface for interacting with features, such as workspace folders and their contained objects, data objects, and computational resources.
Data science & engineering
Data science & engineering tools aid collaboration among data scientists, data engineers, and data analysts. This section describes the fundamental concepts.
A workspace is an environment for accessing all of your Azure Databricks assets. A workspace organizes objects (notebooks, libraries, dashboards, and experiments) into folders and provides access to data objects and computational resources.
A web-based interface for creating data science and machine learning workflows that can contain runnable commands, visualizations, and narrative text. See Introduction to Databricks notebooks.
An interface that provides organized access to visualizations. See Dashboards.
A package of code available to the notebook or job running on your cluster. Databricks runtimes include many libraries and you can add your own.
A folder whose contents are co-versioned together by syncing them to a remote Git repository. Databricks Repos integrate with Git to provide source and version control for your projects.
A collection of MLflow runs for training a machine learning model. See Organize training runs with MLflow experiments.
Azure Databricks interfaces
This section describes the interfaces that Azure Databricks supports, in addition to the UI, for accessing your assets: API and command-line (CLI).
There are three versions of the REST API: 2.1, 2.0, and 1.2. Databricks recommends REST APIs 2.1 and 2.0, which support most of the functionality of the REST API 1.2.
An open source project hosted on GitHub. The CLI is built on top of the REST API (latest).
This section describes the objects that hold the data on which you perform analytics and feed into machine learning algorithms.
Databricks File System (DBFS)
A filesystem abstraction layer over a blob store. It contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Azure Databricks. See What is the Databricks File System (DBFS)?.
A collection of data objects, such as tables or views and functions, that is organized so that it can be easily accessed, managed, and updated. See What is a database?
A representation of structured data. You query tables with Apache Spark SQL and Apache Spark APIs. See What is a table?
By default, all tables created in Azure Databricks are Delta tables. Delta tables are based on the Delta Lake open source project, a framework for high-performance ACID table storage over cloud object stores. A Delta table stores data as a directory of files on cloud object storage and registers table metadata to the metastore within a catalog and schema.
Find out more about technologies branded as Delta.
The component that stores all the structure information of the various tables and partitions in the data warehouse including column and column type information, the serializers and deserializers necessary to read and write data, and the corresponding files where the data is stored. See What is a metastore?
Every Azure Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. You also have the option to use an existing external Hive metastore.
A graphical presentation of the result of running a query. See Visualizations.
This section describes concepts that you need to know to run computations in Azure Databricks.
A set of computation resources and configurations on which you run notebooks and jobs. There are two types of clusters: all-purpose and job. See Clusters.
- You create an all-purpose cluster using the UI, CLI, or REST API. You can manually terminate and restart an all-purpose cluster. Multiple users can share such clusters to do collaborative interactive analysis.
- The Azure Databricks job scheduler creates a job cluster when you run a job on a new job cluster and terminates the cluster when the job is complete. You cannot restart an job cluster.
A set of idle, ready-to-use instances that reduce cluster start and auto-scaling times. When attached to a pool, a cluster allocates its driver and worker nodes from the pool. See Create a pool.
If the pool does not have sufficient idle resources to accommodate the cluster’s request, the pool expands by allocating new instances from the instance provider. When an attached cluster is terminated, the instances it used are returned to the pool and can be reused by a different cluster.
The set of core components that run on the clusters managed by Azure Databricks. See Databricks runtimes.* Azure Databricks has the following runtimes:
- Databricks Runtime includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security of big data analytics.
- Databricks Runtime for Machine Learning is built on Databricks Runtime and provides prebuilt machine learning infrastructure that is integrated with all of the capabilities of the Azure Databricks workspace. It contains multiple popular libraries, including TensorFlow, Keras, PyTorch, and XGBoost.
- Databricks Light is the Azure Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits provided by Databricks Runtime. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive or notebook job workloads.
Frameworks to develop and run data processing pipelines:
- Create, run, and manage Azure Databricks Jobs: A non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.
- What is Delta Live Tables?: A framework for building reliable, maintainable, and testable data processing pipelines.
See What is Azure Databricks Workflows?.
Azure Databricks identifies two types of workloads subject to different pricing schemes: data engineering (job) and data analytics (all-purpose).
- Data engineering An (automated) workload runs on a job cluster which the Azure Databricks job scheduler creates for each workload.
- Data analytics An (interactive) workload runs on an all-purpose cluster. Interactive workloads typically run commands within an Azure Databricks notebook. However, running a job on an existing all-purpose cluster is also treated as an interactive workload.
The state for a read–eval–print loop (REPL) environment for each supported programming language. The languages supported are Python, R, Scala, and SQL.
Machine Learning on Azure Databricks is an integrated end-to-end environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.
The main unit of organization for tracking machine learning model development. See Organize training runs with MLflow experiments. Experiments organize, display, and control access to individual logged runs of model training code.
A centralized repository of features. See Databricks Feature Store Feature Store enables feature sharing and discovery across your organization and also ensures that the same feature computation code is used for model training and inference.
Models & model registry
A trained machine learning or deep learning model that has been registered in Model Registry.
SQL REST API
An interface that allows you to automate tasks on SQL objects. See Databricks SQL API reference.
A presentation of data visualizations and commentary. See Databricks SQL dashboards.
This section describes concepts that you need to know to run SQL queries in Azure Databricks.
- Query: A valid SQL statement.
- SQL warehouse: A computation resource on which you execute SQL queries.
- Query history: A list of executed queries and their performance characteristics.
Submit and view feedback for