Connect to dbt Core

Artikkel
09/04/2024

This artcle explains what is dbt, how to install dbt Core, and how to connect. The hosted version of dbt, called dbt Cloud is also available. For more information, see Connect to dbt Cloud.

What is dbt?

dbt (data build tool) is a development environment for transforming data by writing select statements. dbt turns these select statements into tables and views. dbt compiles your code into raw SQL and then runs that code on the specified database in Azure Databricks. dbt supports collaborative coding patterns and best practices, including version control, documentation, and modularity.

dbt does not extract or load data. dbt focuses on the transformation step only, using a “transform after load” architecture. dbt assumes that you already have a copy of your data in your database.

dbt Core enables you to write dbt code in the IDE of your choice on your local development machine and then run dbt from the command line. dbt Core includes the dbt Command Line Interface (CLI). The dbt CLI is free to use and open source.

dbt Core (and dbt Cloud) can use hosted git repositories. For more information, see Creating a dbt project and Using an existing project on the dbt website.

Installation requirements

Before you install dbt Core, you must install the following on your local development machine:

Python 3.7 or higher
A utility for creating Python virtual environments (such as pipenv)

You also need one of the following to authenticate:

(Recommended) dbt Core enabled as an OAuth application in your account. This is enabled by default.
A personal access token

Note

As a security best practice when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use OAuth tokens.

If you use personal access token authentication, Databricks recommends using personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.

Step 1: Install the dbt Databricks adapter

We recommend using a Python virtual environment because it isolates package versions and code dependencies to that specific environment, regardless of the package versions and code dependencies in other environments. This helps reduce unexpected package version mismatches and code dependency collisions.

Databricks recommends version 1.8.0 or greater of the dbt-databricks package.

.. important:: If your local development machine uses any of the following operating systems, you must complete additional steps first: CentOS, MacOS, Ubuntu, Debian, and Windows. See the “Does my operating system have prerequisites” section of Use pip to install dbt on the dbt Labs website.

Step 2: Create a dbt project and specify and test connection settings

Create a dbt project (a collection of related directories and files required to use dbt). You then configure your connection profiles, which contain connection settings to an Azure Databricks compute, a SQL warehouse, or both. To increase security, dbt projects and profiles are stored in separate locations by default.

With the virtual environment still activated, run the dbt init command with the project name. This example procedure creates a project named my_dbt_demo.
```
dbt init my_dbt_demo
```
When you are prompted to choose a databricks or spark database, enter the number that corresponds to databricks.
When prompted for a host value, do the following:
- For a compute, enter the Server Hostname value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks compute.
- For a SQL warehouse, enter the Server Hostname value from the Connection Details tab for your SQL warehouse.
When prompted for an http_path value, do the following:
- For a compute, enter the HTTP Path value from the Advanced Options, JDBC/ODBC tab for your Azure Databricks compute.
- For a SQL warehouse, enter the HTTP Path value from the Connection Details tab for your SQL warehouse.
To choose an authentication type, enter the number that corresponds with use oauth (recommended) or use access token.
If you chose use access token for your authentication type, enter the value of your Azure Databricks personal access token.

Note

As a security best practice, when you authenticate with automated tools, systems, scripts, and apps, Databricks recommends that you use personal access tokens belonging to service principals instead of workspace users. To create tokens for service principals, see Manage tokens for a service principal.
When prompted for the desired Unity Catalog option value, enter the number that corresponds with use Unity Catalog or not use Unity Catalog.
If you chose to use Unity Catalog, enter the desired value for catalog when prompted.
Enter the desired values for schema and threads when prompted.
dbt writes your entries to a profiles.yml file. The location of this file is listed in the output of the dbt init command. You can also list this location later by running the dbt debug --config-dir command. You can open this file now to examine and verify its contents.

If you chose use oauth for your authentication type, add your machine-to-machine (M2M) or user-to-machine (U2M) authentication profile to profiles.yml.

For examples, see Configure Azure Databricks sign-on from dbt Core with Microsoft Entra ID.

Databricks does not recommend specifying secrets in profiles.yml directly. Instead, set the client ID and client secret as environment variables.
Confirm the connection details by running the dbt debug command on the my_dbt_demo directory.

If you chose use oauth for your authentication type, you’re prompted to sign in with your identity provider.

Important

Before you begin, verify that your compute or SQL warehouse is running.

You should see output similar to the following:
```
cd my_dbt_demo
dbt debug
```
```
...
Configuration:
  profiles.yml file [OK found and valid]
  dbt_project.yml file [OK found and valid]

Required dependencies:
  - git [OK found]

Connection:
  ...
  Connection test: OK connection ok
```

Next steps

Create, run, and test dbt Core models locally. See the dbt Core tutorial.
Run dbt Core projects as Azure Databricks job tasks. See Use dbt transformations in an Azure Databricks job.

Additional resources

What, exactly, is dbt?
Analytics Engineering for Everyone: Databricks in dbt Cloud on the dbt website.
dbt Getting Started tutorial
dbt documentation
dbt CLI documentation
dbt + Databricks Demo
dbt blog

Del via