Understand costs AZ Databricks job

WeirdMan 220 Reputation points
2024-05-31T07:58:49.75+00:00

I have an Azure Data Factory pipeline that executes multiple Databricks Notebooks using job clusters. I need to track the cost of these job clusters, including both the Databricks and the underlying VM costs, specifically for this set of jobs.

Currently, I can filter by Tag and jobid, but this process is manual and cumbersome, especially since the jobid changes with each pipeline run.

Is there a more automated way to tag or filter these costs, perhaps using the service principal that runs these jobs or another method?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,035 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,950 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 18,261 Reputation points
    2024-05-31T08:10:21.8166667+00:00

    Azure Cost Management is an integrated tool within the Azure Portal designed to monitor and understand usage costs for Azure services, including Azure Databricks.

    When you create an Azure Databricks Workspace, an associated Managed Resource Group is automatically generated. This managed resource group contains essential components such as the default storage account, virtual machines for cluster nodes, disks for the nodes, networking resources, and more. The screenshots below show the Azure Databricks Workspace resource alongside its associated Managed Resource Group.

    For monitoring usage via cluster, pool, and workspace tags, the official documentation provides comprehensive details about tags and their propagation to resources.

    To access the Cost Analysis section in the Azure Portal, search for "Cost Management + Billing," navigate to "Cost Management," and then select "Cost Analysis."

    To view total Databricks costs grouped by Meter Category, apply a filter using the Vendor tag with the value "Databricks," and group by either Meter Category or Meter Subcategory. You can also group by Meter for a more detailed breakdown or select "None" for a single line item. There are various other grouping options available for experimentation.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Vinodh247-1375 12,426 Reputation points
    2024-05-31T08:55:07.06+00:00
    1. You can automate tagging using Databricks REST API or Databricks CLI in your pipeline. You can include tags when you create job clusters.

    databricks cli:

    databricks clusters create --json '{ "cluster_name": "my-cluster", "spark_version": "5.3.x-scala2.11", "node_type_id": "Standard_DS3_v2", "num_workers": 2, "tags": { "Environment": "Production", "Project": "YourProjectName", "RunId": "${pipelineRunId}" } }'
    
    1. Automating Tagging in ADF Pipeline

    Add a Web Activity in your ADF pipeline to call the Databricks REST API to create a cluster with tags.

    {
      "name": "CreateDatabricksCluster",
      "type": "WebActivity",
      "linkedServiceName": {
        "referenceName": "DatabricksLinkedService",
        "type": "LinkedServiceReference"
      },
      "method": "POST",
      "url": "https://<databricks-instance>/api/2.0/clusters/create",
      "headers": {
        "Authorization": "Bearer <token>"
      },
      "body": {
        "cluster_name": "my-cluster",
        "spark_version": "5.3.x-scala2.11",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 2,
        "tags": {
          "Environment": "Production",
          "Project": "YourProjectName",
          "RunId": "@pipeline().RunId"
        }
      }
    }
    
    1. Monitor Costs Using Azure Log Analytics:
      • By enabling logging for your databricks workspace and sending the logs to azure Log Analytics. Through this you can create custom queries to track the costs associated with specific job runs. You can modify the query to suit your requirement.
         AzureDiagnostics
         | where ResourceType == "DATBRICKS" and Tags["RunId"] == "<run-id>"
         | summarize TotalCost = sum(Cost) by ResourceGroup, ResourceType, Tags
         
      
    0 comments No comments