Collect your Apache Spark applications logs and metrics using Azure storage account (preview)

Članak
2024-09-10

The Fabric Apache Spark diagnostic emitter extension is a library that enables Apache Spark applications to emit logs, event logs, and metrics to multiple destinations, including Azure Log Analytics, Azure Storage, and Azure Event Hubs.

In this tutorial, you learn how to use the Fabric Apache Spark diagnostic emitter extension to send Apache Spark application logs, event logs, and metrics to your Azure Storage account.

Collect logs and metrics to storage account

Step 1: Create a storage account

To collect diagnostic logs and metrics, you can use an existing Azure Storage account. If you don't have one, you can create an Azure blob storage account or create a storage account to use with Azure Data Lake Storage Gen2.

Step 2: Create a Fabric Environment Artifact with Apache Spark Configuration

Option 1: Configure with Azure Storage URI and Access key

Create a Fabric Environment Artifact in Fabric

Add the following Spark properties with the appropriate values to the environment artifact, or select Add from .yml in the ribbon to download the sample yaml file, which already containing the following properties.

properties

spark.synapse.diagnostic.emitters: MyStorageBlob
spark.synapse.diagnostic.emitter.MyStorageBlob.type: "AzureStorage"
spark.synapse.diagnostic.emitter.MyStorageBlob.categories: "DriverLog,ExecutorLog,EventLog,Metrics"
spark.synapse.diagnostic.emitter.MyStorageBlob.uri:  "https://<my-blob-storage>.blob.core.windows.net/<container-name>/<folder-name>"
spark.synapse.diagnostic.emitter.MyStorageBlob.auth: "AccessKey"
spark.synapse.diagnostic.emitter.MyStorageBlob.secret: <storage-access-key>
spark.fabric.pools.skipStarterPools: "true" //Add this Spark property when using the default pool.

Fill in the following parameters in the configuration file: <my-blob-storage>, <container-name>, <folder-name>, <storage-access-key>. For more details on these parameters, see Azure Storage configurations.

Option 2: Configure with Azure Key Vault

Bilješka

Known issue: Unable to start a session using Option 2 provisionally. Currently, storing secrets in Key Vault prevents Spark sessions from starting. Please prioritize configuring it using the method outlined in Option 1.

Ensure that users who submit Apache Spark applications are granted read secret permissions. For more information, see Provide access to Key Vault keys, certificates, and secrets with an Azure role-based access control.

To configure Azure Key Vault for storing the workspace key:

Create and go to your key vault in the Azure portal.
On the settings page for the key vault, select Secrets, then Generate/Import.
On the Create a secret screen, choose the following values:
- Name: Enter a name for the secret.
- Value: Enter the <storage-access-key> for the secret.
- Leave the other values to their defaults. Then select Create.
Create a Fabric Environment Artifact in Fabric.

Add the following Spark properties. Or select Add from .yml on the ribbon to upload the sample yaml file which includes following Spark properties.

properties

spark.synapse.diagnostic.emitters: <MyStorageBlob>
spark.synapse.diagnostic.emitter.MyStorageBlob.type: "AzureStorage"
spark.synapse.diagnostic.emitter.MyStorageBlob.categories: "DriverLog,ExecutorLog,EventLog,Metrics"
spark.synapse.diagnostic.emitter.MyStorageBlob.uri:  "https://<my-blob-storage>.blob.core.windows.net/<container-name>/<folder-name>"
spark.synapse.diagnostic.emitter.MyStorageBlob.auth: "AccessKey"
spark.synapse.diagnostic.emitter.MyStorageBlob.secret.keyVault: <AZURE_KEY_VAULT_NAME>
spark.synapse.diagnostic.emitter.MyStorageBlob.secret.keyVault.secretName: <AZURE_KEY_VAULT_SECRET_KEY_NAME>
spark.fabric.pools.skipStarterPools: "true" //Add this Spark property when using the default pool.

Fill in the following parameters in the configuration file: <my-blob-storage>, <container-name>, <folder-name>, <AZURE_KEY_VAULT_NAME>, <AZURE_KEY_VAULT_SECRET_KEY_NAME>. For more details on these parameters, see Azure Storage configurations.

Save and publish changes.

Step 3: Attach the environment artifact to notebooks or spark job definitions, or set it as the workspace default

To attach the environment to Notebooks or Spark job definitions:

Navigate to the specific notebook or Spark job definition in Fabric.
Select the Environment menu on the Home tab and select the environment with the configured diagnostics Spark properties.
The configuration is applied when you start a Spark session.

To set the environment as the workspace default:

Navigate to Workspace Settings in Fabric.
Find the Spark settings in your Workspace settings (Workspace setting -> Data Engineering/Science -> Spark settings).
Select Environment tab and choose the environment with diagnostics spark properties configured, and click Save.

Bilješka

Only workspace admins can manage workspace configurations. Changes made here will apply to all notebooks and Spark job definitions attached to the workspace settings. For more information, see Fabric Workspace Settings.

Step 4: View the logs files in Azure storage account

After submitting a job to the configured Spark session, you can view the logs and metrics files in the destination storage account. The logs are stored in corresponding paths based on different applications, identified by <workspaceId>.<fabricLivyId>. All log files are in JSON Lines format (also known as newline-delimited JSON or ndjson), which is convenient for data processing.

Available configurations

Configuration	Description
`spark.synapse.diagnostic.emitters`	Required. The comma-separated destination names of diagnostic emitters. For example, `MyDest1,MyDest2`
`spark.synapse.diagnostic.emitter.<destination>.type`	Required. Built-in destination type. To enable Azure storage destination, `AzureStorage` needs to be included in this field.
`spark.synapse.diagnostic.emitter.<destination>.categories`	Optional. The comma-separated selected log categories. Available values include `DriverLog`, `ExecutorLog`, `EventLog`, `Metrics`. If not set, the default value is all categories.
`spark.synapse.diagnostic.emitter.<destination>.auth`	Required. `AccessKey` for using storage account access key authorization. `SAS` for shared access signatures authorization.
`spark.synapse.diagnostic.emitter.<destination>.uri`	Required. The destination blob container folder uri. Should match pattern `https://<my-blob-storage>.blob.core.windows.net/<container-name>/<folder-name>`.
`spark.synapse.diagnostic.emitter.<destination>.secret`	Optional. The secret (AccessKey or SAS) content.
`spark.synapse.diagnostic.emitter.<destination>.secret.keyVault`	Required if `.secret` isn't specified. The Azure Key vault name where the secret (AccessKey or SAS) is stored.
`spark.synapse.diagnostic.emitter.<destination>.secret.keyVault.secretName`	Required if `.secret.keyVault` is specified. The Azure Key vault secret name where the secret (AccessKey or SAS) is stored.
`spark.synapse.diagnostic.emitter.<destination>.filter.eventName.match`	Optional. The comma-separated spark event names, you can specify which events to collect. For example: `SparkListenerApplicationStart,SparkListenerApplicationEnd`
`spark.synapse.diagnostic.emitter.<destination>.filter.loggerName.match`	Optional. The comma-separated Log4j logger names, you can specify which logs to collect. For example: `org.apache.spark.SparkContext,org.example.Logger`
`spark.synapse.diagnostic.emitter.<destination>.filter.metricName.match`	Optional. The comma-separated spark metric name suffixes, you can specify which metrics to collect. For example: `jvm.heap.used`
`spark.fabric.pools.skipStarterPools`	Required. This Spark property is used to force an on-demand Spark session. You should set the value to `True` when using the default pool in order to trigger the libraries to emit logs and metrics.

Log data sample

Here's a sample log record in JSON format:

JSON

{
  "timestamp": "2024-09-06T03:09:37.235Z",
  "category": "Log|EventLog|Metrics",
  "fabricLivyId": "<fabric-livy-id>",
  "applicationId": "<application-id>",
  "applicationName": "<application-name>",
  "executorId": "<driver-or-executor-id>",
  "fabricTenantId": "<my-fabric-tenant-id>",
  "capacityId": "<my-fabric-capacity-id>",
  "artifactType": "SynapseNotebook|SparkJobDefinition",
  "artifactId": "<my-fabric-artifact-id>",
  "fabricWorkspaceId": "<my-fabric-workspace-id>",
  "fabricEnvId": "<my-fabric-environment-id>",
  "executorMin": "<executor-min>",
  "executorMax": "<executor-max>",
  "isHighConcurrencyEnabled": "true|false",
  "properties": {
    // The message properties of logs, events and metrics.
    "timestamp": "2024-09-06T03:09:37.235Z",
    "message": "Initialized BlockManager: BlockManagerId(1, vm-04b22223, 34319, None)",
    "logger_name": "org.apache.spark.storage.BlockManager",
    "level": "INFO",
    "thread_name": "dispatcher-Executor"
    //...
  }
}

Fabric workspaces with Managed virtual network

Create a managed private endpoint for the target Azure Blob Storage. For detailed instructions, refer to Create and use managed private endpoints in Microsoft Fabric - Microsoft Fabric.

Once the managed private endpoint is approved, users can begin emitting logs and metrics to the target Azure Blob Storage.

Next steps

Dodatni resursi

Dokumentacija

Collect your Apache Spark applications logs and metrics using Azure Event Hubs - Microsoft Fabric

In this tutorial, you learn how to use the Fabric Apache Spark diagnostic emitter extension to emit Apache Spark applications logs, event logs and metrics to your Azure Event Hubs.
Monitor Apache Spark applications with Azure Log Analytics - Microsoft Fabric

Learn how to enable the Fabric connector for collecting and sending the Apache Spark application metrics and logs to your Log Analytics workspace.
Apache Spark application detail monitoring - Microsoft Fabric

Learn how to monitor your Apache Spark application details, including recent run status, issues, and the progress of your jobs.
Monitor Apache Spark Applications Resource Utilization - Microsoft Fabric

Learn how to monitor Apache Spark applications using resource utilization.
Use the Monitor pane to manage Apache Spark applications - Microsoft Fabric

Learn how to access the Monitor pane and use it to sort, search, filter, manage, and cancel Apache Spark applications.
Monitor Apache Spark run series - Microsoft Fabric

The Spark run series categorizes your Spark applications based on recurring pipeline activities, manual notebook runs, or Spark job runs.
What is Spark run series analysis? - Microsoft Fabric

Learn about the Apache Spark run series analysis, including examples of the analysis and when to use it.
Debug apps with the extended Apache Spark history server - Microsoft Fabric

Use the extended Apache Spark history server to debug and diagnose Apache Spark applications in Fabric.

Obučavanje

Modul

Use Apache Spark in Microsoft Fabric - Training

Apache Spark is a core technology for large-scale data analytics. Microsoft Fabric provides support for Spark clusters, enabling you to analyze and process data at scale.

Certifikacija

Microsoft Certified: Fabric Data Engineer Associate - Certifications

As a Fabric Data Engineer, you should have subject matter expertise with data loading patterns, data architectures, and orchestration processes.

Događaji

FabCon Vegas

M03 31 23 - M04 2 23

Najveći događaj učenja platforme Fabric, platforme Power BI i jezika SQL. 31. mart - 2. april Koristite kod FABINSIDER da uštedite 400 dolara.

Registrirajte se danas

Dijelite putem