Jaa


Use Databricks SQL in an Azure Databricks job

You can use the SQL task type in an Azure Databricks job, allowing you to create, schedule, operate, and monitor workflows that include Databricks SQL objects such as queries, legacy dashboards, and alerts. For example, your workflow can ingest data, prepare the data, perform analysis using Databricks SQL queries, and then display the results in a legacy dashboard.

This article provides an example workflow that creates a legacy dashboard displaying metrics for GitHub contributions. In this example, you will:

  • Ingest GitHub data using a Python script and the GitHub REST API.
  • Transform the GitHub data using a Delta Live Tables pipeline.
  • Trigger Databricks SQL queries performing analysis on the prepared data.
  • Display the analysis in a legacy dashboard.

GitHub analysis dashboard

Before you begin

You need the following to complete this walkthrough:

Step 1: Store the GitHub token in a secret

Instead of hardcoding credentials such as the GitHub personal access token in a job, Databricks recommends using a secret scope to store and manage secrets securely. The following Databricks CLI commands are an example of creating a secret scope and storing the GitHub token in a secret in that scope:

databricks secrets create-scope <scope-name>
databricks secrets put-secret <scope-name> <token-key> --string-value <token>
  • Replace <scope-name with the name of an Azure Databricks secret scope to store the token.
  • Replace <token-key> with the name of a key to assign to the token.
  • Replace <token> with the value of the GitHub personal access token.

Step 2: Create a script to fetch GitHub data

The following Python script uses the GitHub REST API to fetch data on commits and contributions from a GitHub repo. Input arguments specify the GitHub repo. The records are saved to a location in DBFS specified by another input argument.

This example uses DBFS to store the Python script, but you can also use Databricks Git folders or workspace files to store and manage the script.

  • Save this script to a location on your local disk:

    import json
    import requests
    import sys
    
    api_url = "https://api.github.com"
    
    def get_commits(owner, repo, token, path):
      page = 1
      request_url =  f"{api_url}/repos/{owner}/{repo}/commits"
      more = True
    
      get_response(request_url, f"{path}/commits", token)
    
    def get_contributors(owner, repo, token, path):
      page = 1
      request_url =  f"{api_url}/repos/{owner}/{repo}/contributors"
      more = True
    
      get_response(request_url, f"{path}/contributors", token)
    
    def get_response(request_url, path, token):
      page = 1
      more = True
    
      while more:
        response = requests.get(request_url, params={'page': page}, headers={'Authorization': "token " + token})
        if response.text != "[]":
          write(path + "/records-" + str(page) + ".json", response.text)
          page += 1
        else:
          more = False
    
    def write(filename, contents):
      dbutils.fs.put(filename, contents)
    
    def main():
      args = sys.argv[1:]
      if len(args) < 6:
        print("Usage: github-api.py owner repo request output-dir secret-scope secret-key")
        sys.exit(1)
    
      owner = sys.argv[1]
      repo = sys.argv[2]
      request = sys.argv[3]
      output_path = sys.argv[4]
      secret_scope = sys.argv[5]
      secret_key = sys.argv[6]
    
      token = dbutils.secrets.get(scope=secret_scope, key=secret_key)
    
      if (request == "commits"):
        get_commits(owner, repo, token, output_path)
      elif (request == "contributors"):
        get_contributors(owner, repo, token, output_path)
    
    if __name__ == "__main__":
        main()
    
  • Upload the script to DBFS:

    1. Go to your Azure Databricks landing page and click Catalog icon Catalog in the sidebar.
    2. Click Browse DBFS.
    3. In the DBFS file browser, click Upload. The Upload Data to DBFS dialog appears.
    4. Enter a path in DBFS to store the script, click Drop files to upload, or click to browse, and select the Python script.
    5. Click Done.

Step 3: Create a Delta Live Tables pipeline to process the GitHub data

In this section, you create a Delta Live Tables pipeline to convert the raw GitHub data into tables that can be analyzed by Databricks SQL queries. To create the pipeline, perform the following steps:

  1. In the sidebar, click New Icon New and select Notebook from the menu. The Create Notebook dialog appears.

  2. In Default Language, enter a name and select Python. You can leave Cluster set to the default value. The Delta Live Tables runtime creates a cluster before it runs your pipeline.

  3. Click Create.

  4. Copy the Python code example and paste it into your new notebook. You can add the example code to a single cell of the notebook or multiple cells.

    import dlt
    from pyspark.sql.functions import *
    
    def parse(df):
       return (df
         .withColumn("author_date", to_timestamp(col("commit.author.date")))
         .withColumn("author_email", col("commit.author.email"))
         .withColumn("author_name", col("commit.author.name"))
         .withColumn("comment_count", col("commit.comment_count"))
         .withColumn("committer_date", to_timestamp(col("commit.committer.date")))
         .withColumn("committer_email", col("commit.committer.email"))
         .withColumn("committer_name", col("commit.committer.name"))
         .withColumn("message", col("commit.message"))
         .withColumn("sha", col("commit.tree.sha"))
         .withColumn("tree_url", col("commit.tree.url"))
         .withColumn("url", col("commit.url"))
         .withColumn("verification_payload", col("commit.verification.payload"))
         .withColumn("verification_reason", col("commit.verification.reason"))
         .withColumn("verification_signature", col("commit.verification.signature"))
         .withColumn("verification_verified", col("commit.verification.signature").cast("string"))
         .drop("commit")
       )
    
    @dlt.table(
       comment="Raw GitHub commits"
    )
    def github_commits_raw():
      df = spark.read.json(spark.conf.get("commits-path"))
      return parse(df.select("commit"))
    
    @dlt.table(
      comment="Info on the author of a commit"
    )
    def commits_by_author():
      return (
        dlt.read("github_commits_raw")
          .withColumnRenamed("author_date", "date")
          .withColumnRenamed("author_email", "email")
          .withColumnRenamed("author_name", "name")
          .select("sha", "date", "email", "name")
      )
    
    @dlt.table(
      comment="GitHub repository contributors"
    )
    def github_contributors_raw():
      return(
        spark.readStream.format("cloudFiles")
          .option("cloudFiles.format", "json")
          .load(spark.conf.get("contribs-path"))
      )
    
  5. In the sidebar, click Workflows Icon Workflows, click the Delta Live Tables tab, and click Create Pipeline.

  6. Give the pipeline a name, for example, Transform GitHub data.

  7. In the Notebook libraries field, enter the path to your notebook or click File Picker Icon to select the notebook.

  8. Click Add configuration. In the Key text box, enter commits-path. In the Value text box, enter the DBFS path where the GitHub records will be written. This can be any path you choose and is the same path you’ll use when configuring the first Python task when you create the workflow.

  9. Click Add configuration again. In the Key text box, enter contribs-path. In the Value text box, enter the DBFS path where the GitHub records will be written. This can be any path you choose and is the same path you’ll use when configuring the second Python task when you create the workflow.

  10. In the Target field, enter a target database, for example, github_tables. Setting a target database publishes the output data to the metastore and is required for the downstream queries analyzing the data produced by the pipeline.

  11. Click Save.

Step 4: Create a workflow to ingest and transform GitHub data

Before analyzing and visualizing the GitHub data with Databricks SQL, you need to ingest and prepare the data. To create a workflow to complete these tasks, perform the following steps:

Create an Azure Databricks job and add the first task

  1. Go to your Azure Databricks landing page and do one of the following:

    • In the sidebar, click Workflows Icon Workflows and click Create Job Button.
    • In the sidebar, click New Icon New and select Job from the menu.
  2. In the task dialog box that appears on the Tasks tab, replace Add a name for your job… with your job name, for example, GitHub analysis workflow.

  3. In Task name, enter a name for the task, for example, get_commits.

  4. In Type, select Python script.

  5. In Source, select DBFS / S3.

  6. In Path, enter the path to the script in DBFS.

  7. In Parameters, enter the following arguments for the Python script:

    ["<owner>","<repo>","commits","<DBFS-output-dir>","<scope-name>","<github-token-key>"]

    • Replace <owner> with the name of the repository owner. For example, to fetch records from the github.com/databrickslabs/overwatch repository, enter databrickslabs.
    • Replace <repo> with the repository name, for example, overwatch.
    • Replace <DBFS-output-dir> with a path in DBFS to store the records fetched from GitHub.
    • Replace <scope-name> with the name of the secret scope you created to store the GitHub token.
    • Replace <github-token-key> with the name of the key you assigned to the GitHub token.
  8. Click Save task.

Add another task

  1. Click Add Task Button below the task you just created.

  2. In Task name, enter a name for the task, for example, get_contributors.

  3. In Type, select the Python script task type.

  4. In Source, select DBFS / S3.

  5. In Path, enter the path to the script in DBFS.

  6. In Parameters, enter the following arguments for the Python script:

    ["<owner>","<repo>","contributors","<DBFS-output-dir>","<scope-name>","<github-token-key>"]

    • Replace <owner> with the name of the repository owner. For example, to fetch records from the github.com/databrickslabs/overwatch repository, enter databrickslabs.
    • Replace <repo> with the repository name, for example, overwatch.
    • Replace <DBFS-output-dir> with a path in DBFS to store the records fetched from GitHub.
    • Replace <scope-name> with the name of the secret scope you created to store the GitHub token.
    • Replace <github-token-key> with the name of the key you assigned to the GitHub token.
  7. Click Save task.

Add a task to transform the data

  1. Click Add Task Button below the task you just created.
  2. In Task name, enter a name for the task, for example, transform_github_data.
  3. In Type, select Delta Live Tables pipeline and enter a name for the task.
  4. In Pipeline, select the pipeline created in Step 3: Create a Delta Live Tables pipeline to process the GitHub data.
  5. Click Create.

Step 5: Run the data transformation workflow

Click Run Now Button to run the workflow. To view details for the run, click the link in the Start time column for the run in the job runs view. Click on each task to view details for the task run.

Step 6: (Optional) To view the output data after the workflow run completes, perform the following steps:

  1. In the run details view, click on the Delta Live Tables task.
  2. In the Task run details panel, click on the pipeline name under Pipeline. The Pipeline details page displays.
  3. Select the commits_by_author table in the pipeline DAG.
  4. Click the table name next to Metastore in the commits_by_author panel. The Catalog Explorer page opens.

In Catalog Explorer, you can view the table schema, sample data, and other details for the data. Follow the same steps to view data for the github_contributors_raw table.

Step 7: Remove the GitHub data

In a real-world application, you might be continuously ingesting and processing data. Because this example downloads and processes the entire data set, you must remove the already downloaded GitHub data to prevent an error when re-running the workflow. To remove the downloaded data, perform the following steps:

  1. Create a new notebook and enter the following commands in the first cell:

    dbutils.fs.rm("<commits-path", True)
    dbutils.fs.rm("<contributors-path", True)
    

    Replace <commits-path> and <contributors-path> with the DBFS paths you configured when creating the Python tasks.

  2. Click Run Menu and select Run Cell.

You can also add this notebook as a task in the workflow.

Step 8: Create the Databricks SQL queries

After running the workflow and creating the required tables, create queries to analyze the prepared data. To create the example queries and visualizations, perform the following steps:

Display the top 10 contributors by month

  1. Click the icon below the Databricks logo Databricks logo in the sidebar and select SQL.

  2. Click Create a query to open the Databricks SQL query editor.

  3. Make sure the catalog is set to hive_metastore. Click default next to hive_metastore and set the database to the Target value you set in the Delta Live Tables pipeline.

  4. In the New Query tab, enter the following query:

    SELECT
      date_part('YEAR', date) AS year,
      date_part('MONTH', date) AS month,
      name,
      count(1)
    FROM
      commits_by_author
    WHERE
      name IN (
        SELECT
          name
        FROM
          commits_by_author
        GROUP BY
          name
        ORDER BY
          count(name) DESC
        LIMIT 10
      )
      AND
        date_part('YEAR', date) >= 2022
    GROUP BY
      name, year, month
    ORDER BY
      year, month, name
    
  5. Click the New query tab and rename the query, for example, Commits by month top 10 contributors.

  6. By default, the results are displayed as a table. To change how the data is visualized, for example, using a bar chart, in the Results panel click Kebab menu and click Edit.

  7. In Visualization type, select Bar.

  8. In X column, select month.

  9. In Y columns, select count(1).

  10. In Group by, select name.

  11. Click Save.

Display the top 20 contributors

  1. Click + > Create new query and make sure the catalog is set to hive_metastore. Click default next to hive_metastore and set the database to the Target value you set in the Delta Live Tables pipeline.

  2. Enter the following query:

    SELECT
      login,
      cast(contributions AS INTEGER)
    FROM
      github_contributors_raw
    ORDER BY
      contributions DESC
    LIMIT 20
    
  3. Click the New query tab and rename the query, for example, Top 20 contributors.

  4. To change the visualization from the default table, in the Results panel, click Kebab menu and click Edit.

  5. In Visualization type, select Bar.

  6. In X column, select login.

  7. In Y columns, select contributions.

  8. Click Save.

Display the total commits by author

  1. Click + > Create new query and make sure the catalog is set to hive_metastore. Click default next to hive_metastore and set the database to the Target value you set in the Delta Live Tables pipeline.

  2. Enter the following query:

    SELECT
      name,
      count(1) commits
    FROM
      commits_by_author
    GROUP BY
      name
    ORDER BY
      commits DESC
    LIMIT 10
    
  3. Click the New query tab and rename the query, for example, Total commits by author.

  4. To change the visualization from the default table, in the Results panel, click Kebab menu and click Edit.

  5. In Visualization type, select Bar.

  6. In X column, select name.

  7. In Y columns, select commits.

  8. Click Save.

Step 9: Create a dashboard

  1. In the sidebar, click Dashboards Icon Dashboards
  2. Click Create dashboard.
  3. Enter a name for the dashboard, for example, GitHub analysis.
  4. For each query and visualization created in Step 8: Create the Databricks SQL queries, click Add > Visualization and select each visualization.

Step 10: Add the SQL tasks to the workflow

To add the new query tasks to the workflow you created in Create an Azure Databricks job and add the first task, for each query you created in Step 8: Create the Databricks SQL queries:

  1. Click Workflows Icon Workflows in the sidebar.
  2. In the Name column, click the job name.
  3. Click the Tasks tab.
  4. Click Add Task Button below the last task.
  5. Enter a name for the task, in Type select SQL, and in SQL task select Query.
  6. Select the query in SQL query.
  7. In SQL warehouse, select a serverless SQL warehouse or a pro SQL warehouse to run the task.
  8. Click Create.

Step 11: Add a dashboard task

  1. Click Add Task Button below the last task.
  2. Enter a name for the task, in Type, select SQL, and in SQL task select Legacy dashboard.
  3. Select the dashboard created in Step 9: Create a dashboard.
  4. In SQL warehouse, select a serverless SQL warehouse or a pro SQL warehouse to run the task.
  5. Click Create.

Step 12: Run the complete workflow

To run the workflow, click Run Now Button. To view details for the run, click the link in the Start time column for the run in the job runs view.

Step 13: View the results

To view the results when the run completes, click the final dashboard task and click the dashboard name under SQL dashboard in the right panel.