Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure Data Explorer is an analytics service that enables you to ingest, store, and query large volumes of data with low latency. It's commonly used for log analytics, telemetry, and time-series workloads that require fast querying over large datasets.
When you use Azure, reliability is a shared responsibility. Microsoft provides a range of capabilities to support resiliency and recovery. You're responsible for understanding how those capabilities work within all of the services you use, and selecting the capabilities you need to meet your business objectives and uptime goals.
This article describes how to make Azure Data Explorer resilient to various potential outages and problems, including transient faults, availability zone failures, and region-wide failures. It also describes backup and restore options and resilience to service maintenance, and highlights key information about the Azure Data Explorer service-level agreement (SLA).
Production deployment recommendations for reliability
For production workloads, we recommend that you take the following steps to improve the reliability of your Azure Data Explorer cluster:
- Deploy a full cluster. Azure Data Explorer provides free clusters for trial purposes. For production workloads, deploy a full cluster.
- Enable availability zone support. Azure Data Explorer supports availability zones. When availability zone support is enabled, compute nodes are distributed across multiple availability zones and data is stored using zone-redundant storage. This configuration improves resilience to availability zone failures.
Reliability architecture overview
This section describes some of the important aspects of how the service works that are most relevant from a reliability perspective. The section introduces the logical architecture, which includes some of the resources and features that you deploy and use. It also discusses the physical architecture, which provides details on how the service works under the covers.
Logical architecture
The primary resource you deploy is a cluster, which represents the infrastructure you need to ingest, store, and query your data. With a cluster, you create databases, which in turn contain tables.
Clusters perform ingestion to retrieve data from other data sources and load it into a table in the cluster. Data can then be queried by using the Kusto Query Language (KQL) syntax. Clusters also have a set of management operations that you can perform.
Physical architecture
An Azure Data Explorer cluster has two primary layers that are applicable to its reliability configuration:
Compute layer: Azure Data Explorer is a distributed computing platform and can have two to many node virtual machines (VMs) depending on scale and node role type. Nodes handle data ingestion and query processing work. You don't see or manage the node VMs directly. The platform automatically manages instance creation, health monitoring, and replacement of unhealthy nodes. When your cluster is configured to use availability zones, the nodes are spread among different datacenters.
Storage layer: Azure Data Explorer uses Azure Storage as its durable persistence layer. Azure Storage automatically provides fault tolerance, with the default setting offering locally redundant storage (LRS) within a datacenter. Three replicas are persisted. If a replica is lost while in use, another is deployed without disruption. When your cluster is configured to use multiple availability zones, the replicas are spread among different datacenters.
For more information, see How Azure Data Explorer works.
Resilience to transient faults
Transient faults are short, intermittent failures in components. They occur frequently in a distributed environment like the cloud, and they're a normal part of operations. Transient faults correct themselves after a short period of time. It's important that your applications can handle transient faults, usually by retrying affected requests.
All cloud-hosted applications should follow the Azure transient fault handling guidance when they communicate with any cloud-hosted APIs, databases, and other components. For more information, see Recommendations for handling transient faults.
To build resilience to transient faults when you use Azure Data Explorer, follow these practices:
- When you use queued ingestion, rely on the built-in retry behavior.
- Use Microsoft-provided client libraries and SDKs, which automatically retry when transient faults occur.
- If you use Azure Data Explorer REST APIs directly, retry any queries and management operations that fail due to a transient fault.
Resilience to availability zone failures
Availability zones are physically separate groups of datacenters within an Azure region. When one zone fails, services can fail over to one of the remaining zones.
Azure Data Explorer supports two types of availability zone configuration:
Zone-redundant (recommended): When you enable availability zones on your cluster, your cluster's nodes are spread across multiple zones. Microsoft manages the distribution of nodes across the selected availability zones and handles detection and response to availability zone failures. A zone-redundant cluster is resilient to an availability zone outage.
When you configure your cluster to be zone-redundant, your data is stored using Azure Storage zone-redundant storage (ZRS), which synchronously replicates at least three copies of the data across multiple availability zones.
Zonal: You can optionally select a single zone when you enable availability zones on your cluster. Microsoft places all of your compute notes into that zone. This is a zonal (single-zone) cluster. This configuration might occasionally help if you have an unusually latency-sensitive workload, but it doesn't provide resilience to zone outages.
Important
Pinning to a single availability zone is only recommended when cross-zone latency is too high for your needs and after you verify that the latency doesn't meet your requirements. By itself, a zonal resource doesn't provide resiliency to an availability zone outage. To improve the resiliency of a zonal resource, you need to explicitly deploy separate resources into multiple availability zones and configure traffic routing and failover. For more information, see Zonal resources and zone resiliency.
Your zone selection only applies to your compute nodes. For a zonal cluster, your storage data continues to use LRS, and might be stored in a different zone to your compute nodes.
If you don't enable availability zones, the cluster is nonzonal, which means Azure selects the availability zone for each node and your data. If any availability zone in the region has an outage, it might affect your cluster's nodes, data, or both. We don't recommend a nonzonal configuration because it doesn't provide protection against availability zone outages.
Requirements
Region support: Availability zone support is available in Azure regions that support availability zones.
However, some compute node types and sizes are only available in specific regions, or specific zones within a region.
Full clusters: Availability zone support is available with full clusters. It's not available with free clusters.
Considerations
Zone selection: For compute nodes, you choose which availability zones to use. Storage zone placement is managed by Microsoft, and storage replicas might be placed in different zones to your compute nodes.
Cost
Enabling availability zone support incurs extra costs for zone-redundant storage, which is billed at a higher rate than locally redundant storage. For more information, see Azure Storage pricing.
Compute nodes are charged at the same rate whether you use availability zone support or not. For more information, see Azure Data Explorer pricing.
Configure availability zone support
Create a new cluster with availability zone support: You can enable availability zone support when you create a new Azure Data Explorer cluster. For more information, see Create a cluster and database.
When you create an availability zone-enabled cluster by using the Azure portal, it's automatically zone-redundant, and Microsoft selects the zones.
To select zones yourself, or to create a zonal cluster, use another deployment approach like Azure Resource Manager APIs or Bicep. For most situations, we recommend that you create a zone-redundant cluster and that you use all of the zones in the region.
Note
When you select which availability zones to use, you're actually selecting the logical availability zone. If you deploy other workload components in a different Azure subscription, they might use a different logical availability zone number to access the same physical availability zone. For more information, see Physical and logical availability zones.
Enable availability zones on an existing cluster (preview): You can migrate an existing nonzonal cluster to use availability zones. This capability is in preview. For more information, see Migrate your cluster to support multiple availability zones.
Reconfigure availability zones on an existing cluster (preview): You can change the zones used for a cluster. This capability is in preview. For more information, see Migrate your cluster to support multiple availability zones.
Disable availability zone support on an existing cluster: After a cluster is configured with availability zones, you can't change the cluster to not use availability zones.
Verify availability zone configuration for clusters: You can use the cluster's zone status property (the
zoneStatusproperty in the REST API) to verify the availability zone configuration of a cluster.If the value is
Zonal, it means the cluster has been configured to use availability zones. However, the cluster might be zonal or zone-redundant. To determine which, use the zones property. If the zones list has one zone listed, the cluster is zonal (single-zone). If it has multiple zones listed, it's zone-redundant.
Capacity planning and management
When an availability zone is unavailable, any nodes in that zone might be temporarily unavailable, which reduces your cluster's compute capacity until the zone recovers.
If your cluster can't tolerate the loss of capacity, consider overprovisioning your cluster. This approach allows the solution to tolerate some capacity loss and continue to function without degraded performance. However, when you overprovision your cluster, your cluster might have an unbalanced number of nodes across zones.
Instance distribution across zones
The cluster's compute layer uses a best-effort approach to evenly spread instances across the zones you select.
Behavior when all zones are healthy
This section describes what to expect when you configure a cluster for availability zone support, and all zones are operational.
Cross-zone operation: During normal operation, Azure Data Explorer uses all available compute nodes for ingestion, query processing, and other operations. Work is distributed across nodes regardless of their availability zone.
Cross-zone data replication: The cross-zone data replication behavior depends on the availability zone configuration that your cluster uses.
Zone-redundant: Data is synchronously replicated across availability zones by using Azure Storage zone-redundant storage. This provides a high level of data consistency and minimizes the risk of data loss during a zone failure.
Zonal: Data is stored using Azure Storage locally redundant storage, which means all three copies might be in a single availability zone.
Behavior during a zone failure
This section describes what to expect when you configure a cluster for availability zone support, and there's an outage in one of the zones.
Detection and response: Responsibility for detection and response depends on the availability zone configuration that your cluster uses.
Zone-redundant: Microsoft detects availability zone failures and manages the response for Azure Data Explorer. You don't need to do anything to initiate a zone failover.
Zonal: You're responsible for detecting a failure that affects an availability zone used by your cluster. You're also responsible for any response you decide to initiate, such as switching to a second cluster you previously created in a different availability zone.
- Notification: Microsoft doesn't automatically notify you when a zone is down. However, you can use Azure Service Health to understand the overall health of the service, including any zone failures, and you can set up Service Health alerts to notify you of problems.
Active requests: Active requests that rely on compute or storage resources in the failed zone might be terminated and should be retried by the client. Ensure that your applications are prepared by following transient fault handling guidance.
Expected data loss: The expected data loss depends on the availability zone configuration that your cluster uses.
Zone-redundant: No data loss is expected during an availability zone outage because data is synchronously replicated across zones.
Zonal: Data is unavailable until the zone recovers. In the unlikely event of a permanent loss of a zone that contains all of your storage replicas, the data might be permanently lost.
Expected downtime: The expected downtime depends on the availability zone configuration that your cluster uses.
Zone-redundant: A brief service interruption might occur while traffic is redirected to healthy availability zones. Ensure that your applications are prepared by following transient fault handling guidance.
Zonal: Your cluster's compute nodes are unavailable until the availability zone recovers. You also might not be able to access your cluster's data during a zone failure.
Redistribution: The traffic rerouting behavior depends on the availability zone configuration that your cluster uses.
Zone-redundant: Azure Data Explorer routes new requests to compute and storage resources in the remaining healthy zones.
Zonal: Your cluster is unavailable until the availability zone recovers.
Zone recovery
When the failed availability zone recovers, Microsoft recreates the cluster nodes and storage replicas in that zone and restores normal traffic distribution across all zones. No customer action is required.
Test for zone failures
The options for testing for zone failures depend on the availability zone configuration that your cluster uses.
Zone-redundant: Availability zone failover and recovery for Azure Data Explorer are fully managed by Microsoft. You don’t need to initiate or validate availability zone failure processes.
Zonal: To partially simulate the loss of all of the compute nodes during a zone outage, you can stop your cluster. You can use this approach to validate parts of your own zone-down detection and failover processes.
Resilience to region-wide failures
An Azure Data Explorer cluster is deployed into a single Azure region. If that region becomes unavailable, the cluster and its data are unavailable.
Custom multi-region solutions for resiliency
To minimize the business impact of a region outage, you can deploy separate Azure Data Explorer clusters in multiple regions. Each cluster is independent, and you’re responsible for managing each cluster, and for coordinating data replication, traffic routing, and failover between regions.
You can decide between different types of multi-region cluster configurations, which each support different levels of recovery time, potential data loss, effort, and cost. You can select Azure regions for each cluster that support your latency and data residency requirements. For more information about multi-region cluster configurations and patterns you can follow, see Outage of an Azure region.
Backup and restore
For most solutions, you shouldn't rely exclusively on backups. Instead, use the other capabilities described in this guide to support your resiliency requirements. However, backups protect against some risks that other approaches don't. For more information, see What are redundancy, replication, and backup?.
Azure Data Explorer doesn't provide a native backup and restore capability. If you need to perform backups of your data, you can consider the following approaches:
- Continuous export, which periodically exports data to external storage, and supports exactly once export of supported data.
- Data export to cloud storage, which enables you to manually export data to external storage.
- Ingest raw data to Azure Data Explorer from an upstream source, like a data lake, that you can back up separately.
Resilience to accidental deletion
Azure Data Explorer includes several mechanisms to help you protect against accidental deletion of clusters, databases, tables, and external tables:
Accidental cluster or database deletion: Accidental cluster or database deletion is an irrecoverable action. You can prevent data loss by enabling a delete lock on the cluster or database resource.
Accidental table deletion: Users with table admin permissions or higher are allowed to drop tables. If one of those users accidentally drops a table, you can recover it using the
.undo drop tablecommand. For this command to be successful, you must first enable the recoverability property in the retention policy.Accidental external table deletion: External tables are Kusto query schema entities that reference data stored outside the database. Deletion of an external table only deletes the table metadata. You can recover it by re-executing the table creation command.
For Azure Blob Storage and Azure Data Lake external tables, use the soft delete capability to protect against accidental deletion or overwrite of a blob for a user-configured amount of time.
Resilience to service maintenance
Azure Data Explorer regularly applies service updates and performs routine maintenance. The Azure platform handles these activities automatically while remaining within the availability levels specified in the SLA. Ensure that your applications are prepared for occasional loss in connectivity during service maintenance by following transient fault handling guidance.
To learn about upcoming maintenance, use Azure Service Health.
Service-level agreement
The service-level agreement (SLA) for Azure services describes the expected availability of each service and the conditions that your solution must meet to achieve that availability expectation. For more information, see SLAs for online services.
To be eligible for the Azure Data Explorer availability SLA, your application needs to handle transient faults by retrying failed requests.