Understand key concepts

Completed

Azure Databricks is an amalgamation of multiple technologies that enable you to work with data at scale. Before using Azure Databricks, there are some key concepts that you should understand.

A diagram showing the key elements of a Databricks solution.

  1. Apache Spark clusters - Spark is a distributed data processing solution that makes use of clusters to scale processing across multiple compute nodes. Each Spark cluster has a driver node to coordinate processing jobs, and one or more worker nodes on which the processing occurs. This distributed model enables each node to operate on a subset of the job in parallel; reducing the overall time for the job to complete. To learn more about clusters in Azure Databricks, see Clusters in the Azure Databricks documentation.
  2. Data lake storage - While each cluster node has its own local file system (on which operating system and other node-specific files are stored), the nodes in a cluster also have access to a shared, distributed file system in which they can access and operate on data files. This shared data storage, known as a data lake, enables you to mount cloud storage, such as Azure Data Lake Storage or a Microsoft OneLake data store, and use it to work with and persist file-based data in any format.
  3. Metastore - Azure Databricks uses a metastore to define a relational schema of tables over file-based data. The tables are based on the Delta Lake format and can be queried using SQL syntax to access the data in the underlying files. The table definitions and details of the file system locations on which they're based are stored in the metastore, abstracting the data objects that you can use for analytics and data processing from the physical storage where the data files are stored. Azure Databricks metastores are managed in Unity Catalog, which provides centralized data storage, access management, and governance (though depending on how your Azure Databricks workspace is configured, you may also use a legacy Hive Metastore with data files stored in a Databricks File System (DBFS) data lake).
  4. Notebooks - One of the most common ways for data analysts, data scientists, data engineers, and developers to work with Spark is to write code in notebooks. Notebooks provide an interactive environment in which you can combine text and graphics in Markdown format with cells containing code that you run interactively in the notebook session. To learn more about notebooks, see Notebooks in the Azure Databricks documentation.
  5. SQL Warehouses - SQL Warehouses are relational compute resources with endpoints that enable client applications to connect to an Azure Databricks workspace and use SQL to work with data in tables. The results of SQL queries can be used to create data visualizations and dashboards to support business analytics and decision making. SQL Warehouses are only available in premium tier Azure Databricks workspaces. To learn more about SQL Warehouses, see SQL Warehouses in the Azure Databricks documentation.