Tutorial: Calculate container statistics by using Databricks

Άρθρο
02/09/2023

This tutorial shows you how to gather statistics about your containers by using Azure Blob Storage inventory along with Azure Databricks.

In this tutorial, you learn how to:

Generate an inventory report
Create an Azure Databricks workspace, cluster, and notebook
Read the blob inventory file
Get the number and total size of blobs, snapshots, and versions
Get the number of blobs by blob type and content type

Prerequisites

An Azure subscription - create an account for free
An Azure storage account - create a storage account

Make sure that your user identity has the Storage Blob Data Contributor role assigned to it.

Generate an inventory report

Enable blob inventory reports for your storage account. See Enable Azure Storage blob inventory reports.

Use the following configuration settings:

Setting	Value
Rule name	blobinventory
Container	<name of your container>
Object type to inventory	Blob
Blob types	Block blobs, Page blobs, and Append blobs
Subtypes	include blob versions, include snapshots, include deleted blobs
Blob inventory fields	All
Inventory frequency	Daily
Export format	CSV

You might have to wait up to 24 hours after enabling inventory reports for your first report to be generated.

Configure Azure Databricks

In this section, you create an Azure Databricks workspace, cluster, and notebook. Later in this tutorial, you paste code snippets into notebook cells, and then run them to gather container statistics.

Create an Azure Databricks workspace. See Create an Azure Databricks workspace.
Create a cluster. See Create a cluster.
Create a notebook and choose Python as the default language of the notebook. See Create a notebook.

Read the blob inventory file

Copy and paste the following code block into the first cell, but don't run this code yet.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType
import pyspark.sql.functions as F  
   storage_account_name = "<storage-account-name>"
   storage_account_key = "<storage-account-key>"
   container = "<container-name>"
   blob_inventory_file = "<blob-inventory-file-name>" 
   hierarchial_namespace_enabled = False

if hierarchial_namespace_enabled == False:
  spark.conf.set("fs.azure.account.key.{0}.blob.core.windows.net".format(storage_account_name), storage_account_key)
  df = spark.read.csv("wasbs://{0}@{1}.blob.core.windows.net/{2}".format(container, storage_account_name, blob_inventory_file), header='true', inferSchema='true')

 else:
  spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name), storage_account_key)
  df = spark.read.csv("abfss://{0}@{1}.dfs.core.windows.net/{2}".format(container, storage_account_name, blob_inventory_file), header='true', inferSchema='true')

In this code block, replace the following values:
- Replace the <storage-account-name> placeholder value with the name of your storage account.
- Replace the <storage-account-key> placeholder value with the account key of your storage account.
- Replace the <container-name> placeholder value with the container that holds the inventory reports.
- Replace the <blob-inventory-file-name> placeholder with the fully qualified name of the inventory file (For example: 2023/02/02/02-16-17/blobinventory/blobinventory_1000000_0.csv).
- If your account has a hierarchical namespace, set the hierarchical_namespace_enabled variable to True.
Press the SHIFT + ENTER keys to run the code in this block.