Create a chaos experiment that uses a Chaos Mesh fault with the Azure CLI
You can use a chaos experiment to verify that your application is resilient to failures by causing those failures in a controlled environment. In this article, you cause periodic Azure Kubernetes Service (AKS) pod failures on a namespace by using a chaos experiment and Azure Chaos Studio. Running this experiment can help you defend against service unavailability when there are sporadic failures.
Chaos Studio uses Chaos Mesh, a free, open-source chaos engineering platform for Kubernetes, to inject faults into an AKS cluster. Chaos Mesh faults are service-direct faults that require Chaos Mesh to be installed on the AKS cluster. You can use these same steps to set up and run an experiment for any AKS Chaos Mesh fault.
Prerequisites
- An Azure subscription. If you don't have an Azure subscription, create an Azure free account before you begin.
- An AKS cluster with Linux node pools. If you don't have an AKS cluster, see the AKS quickstart that uses the Azure CLI, Azure PowerShell, or the Azure portal.
Limitations
- You can use Chaos Mesh faults with private clusters by configuring VNet Injection in Chaos Studio. Any commands issued to the private cluster, including the steps in this article to set up Chaos Mesh, need to follow the private cluster guidance. Recommended methods include connecting from a VM in the same virtual network or using the AKS command invoke feature.
- AKS Chaos Mesh faults are only supported on Linux node pools.
- If your AKS cluster is configured to only allow authorized IP ranges, you need to allow Chaos Studio's IP ranges. You can find them by querying the
ChaosStudio
service tag with the Service Tag Discovery API or downloadable JSON files.
Open Azure Cloud Shell
Azure Cloud Shell is a free interactive shell that you can use to run the steps in this article. It has common Azure tools preinstalled and configured to use with your account.
To open Cloud Shell, select Try it in the upper-right corner of a code block. You can also open Cloud Shell in a separate browser tab by going to Bash. Select Copy to copy the blocks of code, paste it into Cloud Shell, and select Enter to run it.
If you prefer to install and use the CLI locally, this tutorial requires Azure CLI version 2.0.30 or later. Run az --version
to find the version. If you need to install or upgrade, see Install Azure CLI.
Note
These instructions use a Bash terminal in Cloud Shell. Some commands might not work as described if you run the CLI locally or in a PowerShell terminal.
Set up Chaos Mesh on your AKS cluster
Before you can run Chaos Mesh faults in Chaos Studio, you must install Chaos Mesh on your AKS cluster.
Run the following commands in a Cloud Shell window where you have the active subscription set to be the subscription where your AKS cluster is deployed. Replace
$RESOURCE_GROUP
and$CLUSTER_NAME
with the resource group and name of your cluster resource.az aks get-credentials -g $RESOURCE_GROUP -n $CLUSTER_NAME helm repo add chaos-mesh https://charts.chaos-mesh.org helm repo update kubectl create ns chaos-testing helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=chaos-testing --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock
Verify that the Chaos Mesh pods are installed by running the following command:
kubectl get po -n chaos-testing
You should see output similar to the following example (a chaos-controller-manager and one or more chaos-daemons):
NAME READY STATUS RESTARTS AGE
chaos-controller-manager-69fd5c46c8-xlqpc 1/1 Running 0 2d5h
chaos-daemon-jb8xh 1/1 Running 0 2d5h
chaos-dashboard-98c4c5f97-tx5ds 1/1 Running 0 2d5h
You can also use the installation instructions on the Chaos Mesh website.
Enable Chaos Studio on your AKS cluster
Chaos Studio can't inject faults against a resource unless that resource is added to Chaos Studio first. To add a resource to Chaos Studio, create a target and capabilities on the resource. AKS clusters have only one target type (service-direct), but other resources might have up to two target types. One target type is for service-direct faults. Another target type is for agent-based faults. Each type of Chaos Mesh fault is represented as a capability like PodChaos, NetworkChaos, and IOChaos.
Create a target by replacing
$SUBSCRIPTION_ID
,$resourceGroupName
, and$AKS_CLUSTER_NAME
with the relevant strings of the AKS cluster you're adding.az rest --method put --url "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$resourceGroupName/providers/Microsoft.ContainerService/managedClusters/$AKS_CLUSTER_NAME/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh?api-version=2024-01-01" --body "{\"properties\":{}}"
Create the capabilities on the target by replacing
$SUBSCRIPTION_ID
,$resourceGroupName
, and$AKS_CLUSTER_NAME
with the relevant strings of the AKS cluster you're adding.
Replace $CAPABILITY
with the "Capability Name" of the fault you're adding.
az rest --method put --url "https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$resourceGroupName/providers/Microsoft.ContainerService/managedClusters/$AKS_CLUSTER_NAME/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/$CAPABILITY?api-version=2024-01-01" --body "{\"properties\":{}}"
Here's an example of enabling the PodChaos
capability for your reference:
az rest --method put --url "https://management.azure.com/subscriptions/b65f2fec-d6b2-4edd-817e-9339d8c01dc4/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myCluster/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh/capabilities/PodChaos-2.1?api-version=2024-01-01" --body "{\"properties\":{}}"
This step must be done for each* capability you want to enable on the cluster.
You've now successfully added your AKS cluster to Chaos Studio.
Create an experiment
Now you can create your experiment. A chaos experiment defines the actions you want to take against target resources. The actions are organized and run in sequential steps. The chaos experiment also defines the actions you want to take against branches, which run in parallel.
Create a Chaos Mesh
jsonSpec
:See the Chaos Mesh documentation for a fault type, for example, the PodChaos type.
Formulate the YAML configuration for that fault type by using the Chaos Mesh documentation.
apiVersion: chaos-mesh.org/v1alpha1 kind: PodChaos metadata: name: pod-failure-example namespace: chaos-testing spec: action: pod-failure mode: all duration: '600s' selector: namespaces: - default
Remove any YAML outside of the
spec
, including the spec property name. Remove the indentation of the spec details. Theduration
parameter isn't necessary, but is used if provided. In this case, remove it.action: pod-failure mode: all selector: namespaces: - default
Use a YAML-to-JSON converter like this one to convert the Chaos Mesh YAML to JSON and minimize it.
{"action":"pod-failure","mode":"all","selector":{"namespaces":["default"]}}
Use a JSON string escape tool like this one to escape the JSON spec, or change the double-quotes to single-quotes.
{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"default\"]}}
{'action':'pod-failure','mode':'all','selector':{'namespaces':['default']}}
Create your experiment JSON by starting with the following JSON sample. Modify the JSON to correspond to the experiment you want to run by using the Create Experiment API, the fault library, and the
jsonSpec
created in the previous step.{ "location": "centralus", "identity": { "type": "SystemAssigned" }, "properties": { "steps": [ { "name": "AKS pod kill", "branches": [ { "name": "AKS pod kill", "actions": [ { "type": "continuous", "selectorId": "Selector1", "duration": "PT10M", "parameters": [ { "key": "jsonSpec", "value": "{\"action\":\"pod-failure\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"default\"]}}" } ], "name": "urn:csci:microsoft:azureKubernetesServiceChaosMesh:podChaos/2.2" } ] } ] } ], "selectors": [ { "id": "Selector1", "type": "List", "targets": [ { "type": "ChaosTarget", "id": "/subscriptions/bbbb1b1b-cc2c-dd3d-ee4e-ffffff5f5f5f/resourceGroups/myRG/providers/Microsoft.ContainerService/managedClusters/myCluster/providers/Microsoft.Chaos/targets/Microsoft-AzureKubernetesServiceChaosMesh" } ] } ] } }
Create the experiment by using the Azure CLI. Replace
$SUBSCRIPTION_ID
,$RESOURCE_GROUP
, and$EXPERIMENT_NAME
with the properties for your experiment. Make sure you've saved and uploaded your experiment JSON. Updateexperiment.json
with your JSON filename.az rest --method put --uri https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME?api-version=2023-11-01 --body @experiment.json
Each experiment creates a corresponding system-assigned managed identity. Note the principal ID for this identity in the response for the next step.
Give the experiment permission to your AKS cluster
When you create a chaos experiment, Chaos Studio creates a system-assigned managed identity that executes faults against your target resources. This identity must be given appropriate permissions to the target resource for the experiment to run successfully.
- Retrieve the
$EXPERIMENT_PRINCIPAL_ID
by running the following command and copying thePrincipalID
from the response. Replace$SUBSCRIPTION_ID
,$RESOURCE_GROUP
, and$EXPERIMENT_NAME
with the properties for your experiment.
az rest --method get --uri https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME?api-version=2024-01-01
- Give the experiment access to your resources by using the following commands. Replace
$EXPERIMENT_PRINCIPAL_ID
with the principal ID from the previous step. Replace$SUBSCRIPTION_ID
,$resourceGroupName
, and$AKS_CLUSTER_NAME
with the relevant strings of the AKS cluster.
az role assignment create --role "Azure Kubernetes Service RBAC Admin Role" --assignee-principal-type "ServicePrincipal" --assignee-object-id $EXPERIMENT_PRINCIPAL_ID --scope subscriptions/$SUBSCRIPTION_ID/resourceGroups/$resourceGroupName/providers/Microsoft.ContainerService/managedClusters/$AKS_CLUSTER_NAME
az role assignment create --role "Azure Kubernetes Service Cluster User Role" --assignee-principal-type "ServicePrincipal" --assignee-object-id $EXPERIMENT_PRINCIPAL_ID --scope subscriptions/$SUBSCRIPTION_ID/resourceGroups/$resourceGroupName/providers/Microsoft.ContainerService/managedClusters/$AKS_CLUSTER_NAME
If you prefer to create custom roles instead of the built-in AKS roles, follow the instructions on the Supported resource types and role assignments for Chaos Studio page to list the role-based access control operations needed for a specific fault and add them to a manually created custom role.
Run your experiment
You're now ready to run your experiment. To see the effect, we recommend that you open your AKS cluster overview and go to Insights in a separate browser tab. Live data for the Active Pod Count shows the effect of running your experiment.
Start the experiment by using the Azure CLI. Replace
$SUBSCRIPTION_ID
,$RESOURCE_GROUP
, and$EXPERIMENT_NAME
with the properties for your experiment.az rest --method post --uri https://management.azure.com/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Chaos/experiments/$EXPERIMENT_NAME/start?api-version=2024-01-01
The response includes a status URL that you can use to query experiment status as the experiment runs.
Next steps
Now that you've run an AKS Chaos Mesh service-direct experiment, you're ready to: