Data management

The Databricks Data Intelligence Platform enables data practitioners throughout your organization to collaborate and productionize data solutions using shared, securely governed data assets and tools.

This article seeks to help you identify the correct starting point for your use case.

Many tasks on Azure Databricks require elevated permissions. Many organizations restrict these elevated permissions to a small number of users or teams. This article disambiguates actions that can be completed by most workspace users from actions that are restricted to privileged users.

Workspace administrators can help you determine if you should be requesting access to assets or requesting elevated permissions.

Find and access data

This section provides a brief overview of tasks to help you discover data assets available to you. Most of these tasks assume that an admin has configured permissions on data assets. See Configure data access.

Feature area Resources
Data discovery For a more detailed overview of data discovery tasks, see Discover data.
Catalogs Catalogs are the top level object in the Unity Catalog data governance model. Use the Catalog Explorer to find table, views, and other data assets. See Explore database objects.

- Standard catalogs contain Unity Catalog schemas, tables, volumes, models, and other database objects. See Create catalogs.
- Foreign catalogs contain federated tables from external systems. See Manage and work with foreign catalogs.
- The hive_metastore catalog object contains tables that use the built-in legacy Hive metastore instead of Unity Catalog for data governance. See Work with Unity Catalog and the legacy Hive metastore.
Connected storage If you have access to compute resources, you can use built-in commands to explore files in connected storage. See Explore storage and find data files.
Upload local files By default, users have permissions to upload small data files from your local machine such as CSVs. See Create or modify a table using file upload.

Work with data

This section provides an overview of common data tasks and the tools used to perform those tasks.

For all of the tasks described, users must have proper permissions to tools, compute resources, data, and other workspace artifacts. See Configure data access and Configure workspaces and infrastructure.

Feature area Resources
Database objects In addition to tables and views, Azure Databricks uses other securable database objects such as volumes to securely govern data. See Database objects in Azure Databricks.
Data permissions Unity Catalog governs all read and write operations in enabled workspaces. You must have adequate permissions to complete these operations. See Securable objects in Unity Catalog.
ETL Extract, transform, and load (ETL) workloads are among the most common uses for Apache Spark and Azure Databricks, and most of the platform has features built and optimized for ETL. See Run your first ETL workload on Azure Databricks.
Queries - All transformations, reports, analyses, or model training runs begin with a query against a table, view, or data files. You can query data using either batch or stream processing. See Query data.

- Perform ad hoc queries using the SQL query editor or notebooks to query tables, views, and other data assets. See Write queries and explore data in the SQL editor and Introduction to Databricks notebooks.
Dashboards & insights - AI/BI dashboards allow you to extract and visualize insights easily in the UI. See Dashboards.

- Genie spaces use text prompts to answer questions and provide insights informed by your data. See What is an AI/BI Genie space.
Ingest - LakeFlow Connect ingests data from popular external systems. See LakeFlow Connect.

- Auto Loader can be used with Delta Live Tables or Structured Streaming jobs to incrementally ingest data from cloud object storage. See What is Auto Loader?.
- You can use Delta Live Tables or Structured Streaming to ingest data from message queues including Kafka. See Query streaming data.
Transformations Azure Databricks uses common syntax and tooling for transformations that range in complexity from SQL CTAS statements to near real-time streaming applications. For an overview of data transformations, see Transform data.

- To learn about using SQL queries for DDL and DML, see Access and manage saved queries.
- For an overview of PySpark, see PySpark on Azure Databricks.- For details on Structured Streaming, see Streaming on Azure Databricks
AI and machine learning The Databricks Data Intelligence Platform provides a suite of tools for data science, machine learning, and AI applications. See AI and machine learning on Databricks.

Configure data access

Most Azure Databricks workspaces rely on a workspace admin or other power users to configure connections to external data sources and enforce privileges to data assets based on team membership, region, or roles. This section provides an overview of common tasks for configuring and controlling data acess that require elevated permissions.

Note

Before requesting elevated permissions to configure a new connection to a data source, confirm whether you are just missing privileges on an existing connection, catalog, or table. If a data source is not available, consult with your organization for the policy for adding new data to your workspace.

Feature area Resources
Unity Catalog - Unity Catalog powers the data governance features built into the Databricks Data Intelligence Platform. See What is Unity Catalog?.

- Databricks account admins, workspace admins, and metastore admins have default privileges to manage Unity Catalog data privileges for users. See Manage privileges in Unity Catalog.
Connections and access - Configuring secure connections to cloud object storage is a keystone activity, and a pre-requisite for nearly all admin and end user related tasks. See Manage access to cloud storage using Unity Catalog.

- Configure connections to external systems using Lakehouse Federation. See Overview of Lakehouse Federation setup.
- Unity Catalog extends data governance to provide access from external systems using open source APIs. See Access Databricks data using external systems.
- Service credentials allow admins to link permissions defined in cloud providers to Unity Catalog, allowing users to leverage these credentials when developing workloads with integrated systems. See Manage access to external cloud services using service credentials.
Sharing - Delta Sharing is the core of the Azure Databricks secure data sharing platform, which includes Databricks Marketplace and Clean Rooms. See Share data and AI assets securely with users in other organizations.

- Admins can create new catalogs. Catalogs provide a high-level abstraction for data isolation and can either be tied to individual workspaces or shared across all workspaces in an account. See Create catalogs.
- AI/BI dashboards encourage owners to embed their credentials when publishing, ensuring that viewers can gain insights from shared results. For details, see Share a dashboard.

Configure workspaces and infrastructure

This section provides an overview of common tasks associated with adminstering workspace assets and infrastructure. Broadly defined, workspace assets include the following:

  • Compute resources: Compute resources include all-purpose interactive clusters, SQL warehouses, job clusters, and pipeline compute. A user or workload must have permissions to connect to running compute resources in order to process specified logic.

    Note

    Users who do not have access to connect to any compute resources have very limited functionality on Azure Databricks.

  • Platform tools: The Databricks Data Intelligence Platform provides a suite of tools tailored to different use cases and personas, such as notebooks, Databricks SQL, and Mosaic AI. Admins can customize settings that include default behaviors, optional features, and user access for many of these tools.

  • Artifacts: Artifacts include notebooks, queries, dashboards, files, libraries, pipelines, and jobs. Artifacts contain code and configurations that users author in order to perform desired actions on their data.

Important

The user who creates a workspace asset is assigned the owner role by default. For most assets, owners can grant permissions to any other user or group in the workspace.

To ensure that data and code are secure, Databricks recommends configuring the owner role for all artifacts and compute resources deployed to a production workspace.

Feature area Resources
Workspace entitlements Workspace entitlements include basic workspace access, access to Databricks SQL, and unrestricted cluster creation. See Manage entitlements.
Compute resource access & policies - Most costs on Azure Databricks are for compute resources. Controlling which users have the ability to configure, deploy, start, and use various resources is vital to controlling costs. See Connect to all-purpose and jobs compute.

- Compute policies work in tandem with workspace compute entitlements to ensure that entitled users only deploy compute resources following specified configuration rules. See Create and manage compute policies.
- Admins can configure default behaviors, data access policies, and user access to SQL warehouses. See SQL warehouse admin settings.
Platform tools Use the admin console to configure behaviors ranging from customizing workspace appearance to enabling or disabling products and features. See Manage your workspace.
Workspace ACLs Workspace access control lists (ACLs) govern how users and groups can interact with workspace assets including compute resources, code artifacts, and jobs. See Access control lists.

Productionize workloads

All Azure Databricks products are built to accelerate the path from development to production, and for scale and stability. This section provides a brief introduction to the suite of tools recommended for getting workloads into production.

Feature area Resources
ETL pipelines Delta Live Tables pipelines provides a declarative syntax for building and productionizing ETL pipelines. See What is Delta Live Tables?.
Orchestration Jobs allows you to define complex workflows with dependencies, triggers, and schedules. See Overview of orchestration on Databricks.
CI/CD Databricks Asset Bundles make it easy to manage and deploy data, assets, and artifacts across workspaces. See What are Databricks Asset Bundles?.