Azure Arc-enabled Kubernetes and GitOps troubleshooting
This document provides troubleshooting guides for issues with Azure Arc-enabled Kubernetes connectivity, permissions, and agents. It also provides troubleshooting guides for Azure GitOps, which can be used in either Azure Arc-enabled Kubernetes or Azure Kubernetes Service (AKS) clusters.
General troubleshooting
Azure CLI
Before using az connectedk8s
or az k8s-configuration
CLI commands, check that Azure CLI is set to work against the correct Azure subscription.
az account set --subscription 'subscriptionId'
az account show
Azure Arc agents
All agents for Azure Arc-enabled Kubernetes are deployed as pods in the azure-arc
namespace. All pods should be running and passing their health checks.
First, verify the Azure Arc Helm Chart release:
$ helm --namespace default status azure-arc
NAME: azure-arc
LAST DEPLOYED: Fri Apr 3 11:13:10 2020
NAMESPACE: default
STATUS: deployed
REVISION: 5
TEST SUITE: None
If the Helm Chart release isn't found or missing, try connecting the cluster to Azure Arc again.
If the Helm Chart release is present with STATUS: deployed
, check the status of the agents using kubectl
:
$ kubectl -n azure-arc get deployments,pods
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/clusteridentityoperator 1/1 1 1 16h
deployment.apps/config-agent 1/1 1 1 16h
deployment.apps/cluster-metadata-operator 1/1 1 1 16h
deployment.apps/controller-manager 1/1 1 1 16h
deployment.apps/flux-logs-agent 1/1 1 1 16h
deployment.apps/metrics-agent 1/1 1 1 16h
deployment.apps/resource-sync-agent 1/1 1 1 16h
NAME READY STATUS RESTART AGE
pod/cluster-metadata-operator-7fb54d9986-g785b 2/2 Running 0 16h
pod/clusteridentityoperator-6d6678ffd4-tx8hr 3/3 Running 0 16h
pod/config-agent-544c4669f9-4th92 3/3 Running 0 16h
pod/controller-manager-fddf5c766-ftd96 3/3 Running 0 16h
pod/flux-logs-agent-7c489f57f4-mwqqv 2/2 Running 0 16h
pod/metrics-agent-58b765c8db-n5l7k 2/2 Running 0 16h
pod/resource-sync-agent-5cf85976c7-522p5 3/3 Running 0 16h
All pods should show STATUS
as Running
with either 3/3
or 2/2
under the READY
column. Fetch logs and describe the pods returning an Error
or CrashLoopBackOff
. If any pods are stuck in Pending
state, there might be insufficient resources on cluster nodes. Scaling up your cluster can get these pods to transition to Running
state.
Connecting Kubernetes clusters to Azure Arc
Connecting clusters to Azure Arc requires access to an Azure subscription and cluster-admin
access to a target cluster. If you can't reach the cluster, or if you have insufficient permissions, connecting the cluster to Azure Arc will fail. Make sure you've met all of the prerequisites to connect a cluster.
Tip
For a visual guide to troubleshooting these issues, see Diagnose connection issues for Arc-enabled Kubernetes clusters.
DNS resolution issues
If you see an error message about an issue with the DNS resolution on your cluster, there are a few things you can try in order to diagnose and resolve the problem.
For more information, see Debugging DNS Resolution.
Outbound network connectivity issues
Issues with outbound network connectivity from the cluster may arise for different reasons. First make sure all of the network requirements have been met.
If you encounter this issue, and your cluster is behind an outbound proxy server, make sure you have passed proxy parameters during the onboarding of your cluster and that the proxy is configured correctly. For more information, see Connect using an outbound proxy server.
Unable to retrieve MSI certificate
Problems retrieving the MSI certificate are usually due to network issues. Check to make sure all of the network requirements have been met, then try again.
Azure CLI is unable to download Helm chart for Azure Arc agents
With Helm version >= 3.7.0, you may run into the following error when using az connectedk8s connect
to connect the cluster to Azure Arc:
az connectedk8s connect -n AzureArcTest -g AzureArcTest
Unable to pull helm chart from the registry 'mcr.microsoft.com/azurearck8s/batch1/stable/azure-arc-k8sagents:1.4.0': Error: unknown command "chart" for "helm"
Run 'helm --help' for usage.
To resolve this issue, you'll need to install a prior version of Helm 3, where the version is less than 3.7.0. After you've installed that version, run the az connectedk8s connect
command again to connect the cluster to Azure Arc.
Insufficient cluster permissions
If the provided kubeconfig file doesn't have sufficient permissions to install the Azure Arc agents, the Azure CLI command will return an error.
az connectedk8s connect --resource-group AzureArc --name AzureArcCluster
Ensure that you have the latest helm version installed before proceeding to avoid unexpected errors.
This operation might take a while...
Error: list: failed to list: secrets is forbidden: User "myuser" cannot list resource "secrets" in API group "" at the cluster scope
To resolve this issue, the user connecting the cluster to Azure Arc should have the cluster-admin
role assigned to them on the cluster.
Unable to connect OpenShift cluster to Azure Arc
If az connectedk8s connect
is timing out and failing when connecting an OpenShift cluster to Azure Arc:
Ensure that the OpenShift cluster meets the version prerequisites: 4.5.41+ or 4.6.35+ or 4.7.18+.
Before you run
az connectedk8s connnect
, run this command on the cluster:oc adm policy add-scc-to-user privileged system:serviceaccount:azure-arc:azure-arc-kube-aad-proxy-sa
Installation timeouts
Connecting a Kubernetes cluster to Azure Arc-enabled Kubernetes requires installation of Azure Arc agents on the cluster. If the cluster is running over a slow internet connection, the container image pull for agents may take longer than the Azure CLI timeouts.
az connectedk8s connect --resource-group AzureArc --name AzureArcCluster
Ensure that you have the latest helm version installed before proceeding to avoid unexpected errors.
This operation might take a while...
Helm timeout error
You may see the following Helm timeout error:
az connectedk8s connect -n AzureArcTest -g AzureArcTest
Unable to install helm release: Error: UPGRADE Failed: time out waiting for the condition
To resolve this issue, try the following steps.
Run the following command:
kubectl get pods -n azure-arc
Check if the
clusterconnect-agent
or theconfig-agent
pods are showingcrashloopbackoff
, or if not all containers are running:NAME READY STATUS RESTARTS AGE cluster-metadata-operator-664bc5f4d-chgkl 2/2 Running 0 4m14s clusterconnect-agent-7cb8b565c7-wklsh 2/3 CrashLoopBackOff 0 1m15s clusteridentityoperator-76d645d8bf-5qx5c 2/2 Running 0 4m15s config-agent-65d5df564f-lffqm 1/2 CrashLoopBackOff 0 1m14s
If the certificate below isn't present, the system assigned managed identity hasn't been installed.
kubectl get secret -n azure-arc -o yaml | grep name:
name: azure-identity-certificate
To resolve this issue, try deleting the Arc deployment by running the
az connectedk8s delete
command and reinstalling it. If the issue continues to happen, it could be an issue with your proxy settings. In that case, try connecting your cluster to Azure Arc via a proxy to connect your cluster to Arc via a proxy. Please also verify if all the network prerequisites have been met.If the
clusterconnect-agent
and theconfig-agent
pods are running, but thekube-aad-proxy
pod is missing, check your pod security policies. This pod uses theazure-arc-kube-aad-proxy-sa
service account, which doesn't have admin permissions but requires the permission to mount host path.If the
kube-aad-proxy
pod is stuck inContainerCreating
state, check whether the kube-aad-proxy certificate has been downloaded onto the cluster.kubectl get secret -n azure-arc -o yaml | grep name:
name: kube-aad-proxy-certificate
If the certificate is missing, delete the deployment and re-onboard with a different name for the cluster. If the problem continues, please contact support.
Helm validation error
Helm v3.3.0-rc.1
version has an issue where helm install/upgrade (used by the connectedk8s
CLI extension) results in running of all hooks leading to the following error:
az connectedk8s connect -n AzureArcTest -g AzureArcTest
Ensure that you have the latest helm version installed before proceeding.
This operation might take a while...
Please check if the azure-arc namespace was deployed and run 'kubectl get pods -n azure-arc' to check if all the pods are in running state. A possible cause for pods stuck in pending state could be insufficientresources on the Kubernetes cluster to onboard to arc.
ValidationError: Unable to install helm release: Error: customresourcedefinitions.apiextensions.k8s.io "connectedclusters.arc.azure.com" not found
To recover from this issue, follow these steps:
Delete the Azure Arc-enabled Kubernetes resource in the Azure portal.
Run the following commands on your machine:
kubectl delete ns azure-arc kubectl delete clusterrolebinding azure-arc-operator kubectl delete secret sh.helm.release.v1.azure-arc.v1
Install a stable version of Helm 3 on your machine instead of the release candidate version.
Run the
az connectedk8s connect
command with the appropriate values to connect the cluster to Azure Arc.
CryptoHash module error
When attempting to onboard Kubernetes clusters to the Azure Arc platform, the local environment (for example, your client console) may return the following error message:
Cannot load native module 'Crypto.Hash._MD5'
Sometimes, dependent modules fail to download successfully when adding the extensions connectedk8s
and k8s-configuration
through Azure CLI or Azure PowerShell. To fix this problem, manually remove and then add the extensions in the local environment.
To remove the extensions, use:
az extension remove --name connectedk8s
az extension remove --name k8s-configuration
To add the extensions, use:
az extension add --name connectedk8s
az extension add --name k8s-configuration
GitOps management
Flux v1 - General
Note
Eventually Azure will stop supporting GitOps with Flux v1, so begin using Flux v2 as soon as possible.
To help troubleshoot issues with sourceControlConfigurations
resource (Flux v1), run these Azure CLI commands with --debug
parameter specified:
az provider show -n Microsoft.KubernetesConfiguration --debug
az k8s-configuration create <parameters> --debug
Flux v1 - Create configurations
Write permissions on the Azure Arc-enabled Kubernetes resource (Microsoft.Kubernetes/connectedClusters/Write
) are necessary and sufficient for creating configurations on that cluster.
sourceControlConfigurations
remains Pending
(Flux v1)
kubectl -n azure-arc logs -l app.kubernetes.io/component=config-agent -c config-agent
$ k -n pending get gitconfigs.clusterconfig.azure.com -o yaml
apiVersion: v1
items:
- apiVersion: clusterconfig.azure.com/v1beta1
kind: GitConfig
metadata:
creationTimestamp: "2020-04-13T20:37:25Z"
generation: 1
name: pending
namespace: pending
resourceVersion: "10088301"
selfLink: /apis/clusterconfig.azure.com/v1beta1/namespaces/pending/gitconfigs/pending
uid: d9452407-ff53-4c02-9b5a-51d55e62f704
spec:
correlationId: ""
deleteOperator: false
enableHelmOperator: false
giturl: git@github.com:slack/cluster-config.git
helmOperatorProperties: null
operatorClientLocation: azurearcfork8s.azurecr.io/arc-preview/fluxctl:0.1.3
operatorInstanceName: pending
operatorParams: '"--disable-registry-scanning"'
operatorScope: cluster
operatorType: flux
status:
configAppliedTime: "2020-04-13T20:38:43.081Z"
isSyncedWithAzure: true
lastPolledStatusTime: ""
message: 'Error: {exit status 1} occurred while doing the operation : {Installing
the operator} on the config'
operatorPropertiesHashed: ""
publicKey: ""
retryCountPublicKey: 0
status: Installing the operator
kind: List
metadata:
resourceVersion: ""
selfLink: ""
Flux v2 - General
To help troubleshoot issues with fluxConfigurations
resource (Flux v2), run these Azure CLI commands with the --debug
parameter specified:
az provider show -n Microsoft.KubernetesConfiguration --debug
az k8s-configuration flux create <parameters> --debug
Flux v2 - Webhook/dry run errors
If you see Flux fail to reconcile with an error like dry-run failed, error: admission webhook "<webhook>" does not support dry run
, you can resolve the issue by finding the ValidatingWebhookConfiguration
or the MutatingWebhookConfiguration
and setting the sideEffects
to None
or NoneOnDryRun
:
For more information, see How do I resolve webhook does not support dry run
errors?
Flux v2 - Error installing the microsoft.flux
extension
The microsoft.flux
extension installs the Flux controllers and Azure GitOps agents into your Azure Arc-enabled Kubernetes or Azure Kubernetes Service (AKS) clusters. If the extension isn't already installed in a cluster and you create a GitOps configuration resource for that cluster, the extension will be installed automatically.
If you experience an error during installation, or if the extension is in a failed state, run a script to investigate. The cluster-type parameter can be set to connectedClusters
for an Arc-enabled cluster or managedClusters
for an AKS cluster. The name of the microsoft.flux
extension will be "flux" if the extension was installed automatically during creation of a GitOps configuration. Look in the "statuses" object for information.
One example:
az k8s-extension show -g <RESOURCE_GROUP> -c <CLUSTER_NAME> -n flux -t <connectedClusters or managedClusters>
"statuses": [
{
"code": "InstallationFailed",
"displayStatus": null,
"level": null,
"message": "unable to add the configuration with configId {extension:flux} due to error: {error while adding the CRD configuration: error {Operation cannot be fulfilled on extensionconfigs.clusterconfig.azure.com \"flux\": the object has been modified; please apply your changes to the latest version and try again}}",
"time": null
}
]
Another example:
az k8s-extension show -g <RESOURCE_GROUP> -c <CLUSTER_NAME> -n flux -t <connectedClusters or managedClusters>
"statuses": [
{
"code": "InstallationFailed",
"displayStatus": null,
"level": null,
"message": "Error: {failed to install chart from path [] for release [flux]: err [cannot re-use a name that is still in use]} occurred while doing the operation : {Installing the extension} on the config",
"time": null
}
]
Another example from the portal:
{'code':'DeploymentFailed','message':'At least one resource deployment operation failed. Please list
deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.
','details':[{'code':'ExtensionCreationFailed', 'message':' Request failed to https://management.azure.com/
subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.ContainerService/
managedclusters/<CLUSTER_NAME>/extensionaddons/flux?api-version=2021-03-01. Error code: BadRequest.
Reason: Bad Request'}]}
For all these cases, possible remediation actions are to force delete the extension, uninstall the Helm release, and delete the flux-system
namespace from the cluster.
az k8s-extension delete --force -g <RESOURCE_GROUP> -c <CLUSTER_NAME> -n flux -t <managedClusters OR connectedClusters>
helm uninstall flux -n flux-system
kubectl delete namespaces flux-system
Some other aspects to consider:
For an AKS cluster, assure that the subscription has the
Microsoft.ContainerService/AKS-ExtensionManager
feature flag enabled.az feature register --namespace Microsoft.ContainerService --name AKS-ExtensionManager
Assure that the cluster doesn't have any policies that restrict creation of the
flux-system
namespace or resources in that namespace.
With these actions accomplished, you can either recreate a flux configuration, which will install the flux extension automatically, or you can reinstall the flux extension manually.
Flux v2 - Installing the microsoft.flux
extension in a cluster with Azure AD Pod Identity enabled
If you attempt to install the Flux extension in a cluster that has Azure Active Directory (Azure AD) Pod Identity enabled, an error may occur in the extension-agent pod.
{"Message":"2021/12/02 10:24:56 Error: in getting auth header : error {adal: Refresh request failed. Status Code = '404'. Response body: no azure identity found for request clientID <REDACTED>\n}","LogType":"ConfigAgentTrace","LogLevel":"Information","Environment":"prod","Role":"ClusterConfigAgent","Location":"westeurope","ArmId":"/subscriptions/<REDACTED>/resourceGroups/<REDACTED>/providers/Microsoft.Kubernetes/managedclusters/<REDACTED>","CorrelationId":"","AgentName":"FluxConfigAgent","AgentVersion":"0.4.2","AgentTimestamp":"2021/12/02 10:24:56"}
The extension status also returns as "Failed".
"{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ExtensionCreationFailed\",\"message\":\" error: Unable to get the status from the local CRD with the error : {Error : Retry for given duration didn't get any results with err {status not populated}}\"}]}}",
The extension-agent pod is trying to get its token from IMDS on the cluster in order to talk to the extension service in Azure, but the token request is intercepted by the pod identity).
You can fix this issue by upgrading to the latest version of the microsoft.flux
extension. For version 1.6.1 or earlier, the workaround is to create an AzurePodIdentityException
that will tell Azure AD Pod Identity to ignore the token requests from flux-extension pods.
apiVersion: aadpodidentity.k8s.io/v1
kind: AzurePodIdentityException
metadata:
name: flux-extension-exception
namespace: flux-system
spec:
podLabels:
app.kubernetes.io/name: flux-extension
Flux v2 - Installing the microsoft.flux
extension in a cluster with Kubelet Identity enabled
When working with Azure Kubernetes clusters, one of the authentication options is kubelet identity using a user-assigned managed identity. Using kubelet identity can reduce operational overhead and increases security when connecting to Azure resources such as Azure Container Registry.
To let Flux use kubelet identity, add the parameter --config useKubeletIdentity=true
when installing the Flux extension. This option is supported starting with version 1.6.1 of the extension.
az k8s-extension create --resource-group <resource-group> --cluster-name <cluster-name> --cluster-type managedClusters --name flux --extension-type microsoft.flux --config useKubeletIdentity=true
Flux v2 - microsoft.flux
extension installation CPU and memory limits
The controllers installed in your Kubernetes cluster with the Microsoft Flux extension require the following CPU and memory resource limits to properly schedule on Kubernetes cluster nodes.
Container Name | CPU limit | Memory limit |
---|---|---|
fluxconfig-agent | 50m | 150Mi |
fluxconfig-controller | 100m | 150Mi |
fluent-bit | 20m | 150Mi |
helm-controller | 1000m | 1Gi |
source-controller | 1000m | 1Gi |
kustomize-controller | 1000m | 1Gi |
notification-controller | 1000m | 1Gi |
image-automation-controller | 1000m | 1Gi |
image-reflector-controller | 1000m | 1Gi |
If you have enabled a custom or built-in Azure Gatekeeper Policy, such as Kubernetes cluster containers CPU and memory resource limits should not exceed the specified limits
, that limits the resources for containers on Kubernetes clusters, you will need to either ensure that the resource limits on the policy are greater than the limits shown above or the flux-system
namespace is part of the excludedNamespaces
parameter in the policy assignment.
Monitoring
Azure Monitor for Containers requires its DaemonSet to run in privileged mode. To successfully set up a Canonical Charmed Kubernetes cluster for monitoring, run the following command:
juju config kubernetes-worker allow-privileged=true
Cluster connect
Old version of agents used
Some older agent versions didn't support the Cluster Connect feature. If you use one of these versions, you may see this error:
az connectedk8s proxy -n AzureArcTest -g AzureArcTest
Hybrid connection for the target resource does not exist. Agent might not have started successfully.
Be sure to use the connectedk8s
Azure CLI extension with version >= 1.2.0, then connect your cluster again to Azure Arc. Also, verify that you've met all the network prerequisites needed for Arc-enabled Kubernetes.
If your cluster is behind an outbound proxy or firewall, verify that websocket connections are enabled for *.servicebus.windows.net
, which is required specifically for the Cluster Connect feature.
Cluster Connect feature disabled
If the clusterconnect-agent
and kube-aad-proxy
pods are missing, then the cluster connect feature is likely disabled on the cluster, and az connectedk8s proxy
will fail to establish a session with the cluster.
az connectedk8s proxy -n AzureArcTest -g AzureArcTest
Cannot connect to the hybrid connection because no agent is connected in the target arc resource.
To resolve this error, enable the Cluster Connect feature on your cluster.
az connectedk8s enable-features --features cluster-connect -n $CLUSTER_NAME -g $RESOURCE_GROUP
Enable custom locations using service principal
When connecting your cluster to Azure Arc or enabling custom locations on an existing cluster, you may see the following warning:
Unable to fetch oid of 'custom-locations' app. Proceeding without enabling the feature. Insufficient privileges to complete the operation.
This warning occurs when you use a service principal to log into Azure. The service principal doesn't have permissions to get information of the application used by Azure Arc service. To avoid this error, execute the following steps:
Sign in into Azure CLI using your user account. Fetch the Object ID of the Azure AD application used by Azure Arc service:
az ad sp show --id bc313c14-388c-4e7d-a58e-70017303ee3b --query objectId -o tsv
Sign in into Azure CLI using the service principal. Use the
<objectId>
value from above step to enable custom locations on the cluster:To enable custom locations when connecting the cluster to Arc, run the following command:
az connectedk8s connect -n <cluster-name> -g <resource-group-name> --custom-locations-oid <objectId>
To enable custom locations on an existing Azure Arc-enabled Kubernetes cluster, run the following command:
az connectedk8s enable-features -n <cluster-name> -g <resource-group-name> --custom-locations-oid <objectId> --features cluster-connect custom-locations
Azure Arc-enabled Open Service Mesh
The steps below provide guidance on validating the deployment of all the Open Service Mesh (OSM) extension components on your cluster.
Check OSM Controller Deployment
kubectl get deployment -n arc-osm-system --selector app=osm-controller
If the OSM Controller is healthy, you'll see output similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE
osm-controller 1/1 1 1 59m
Check the OSM Controller Pod
kubectl get pods -n arc-osm-system --selector app=osm-controller
If the OSM Controller is healthy, you'll see output similar to the following:
NAME READY STATUS RESTARTS AGE
osm-controller-b5bd66db-wglzl 0/1 Evicted 0 61m
osm-controller-b5bd66db-wvl9w 1/1 Running 0 31m
Even though one controller was Evicted at some point, there's another which is READY 1/1
and Running
with 0
restarts. If the column READY
is anything other than 1/1
, the service mesh would be in a broken state. Column READY
with 0/1
indicates the control plane container is crashing. Use the following command to inspect controller logs:
kubectl logs -n arc-osm-system -l app=osm-controller
Column READY
with a number higher than 1 after the /
would indicate that there are sidecars installed. OSM Controller would most likely not work with any sidecars attached to it.
Check OSM Controller Service
kubectl get service -n arc-osm-system osm-controller
If the OSM Controller is healthy, you'll see the following output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
osm-controller ClusterIP 10.0.31.254 <none> 15128/TCP,9092/TCP 67m
Note
The CLUSTER-IP
would be different. The service NAME
and PORT(S)
must be the same as seen in the output.
Check OSM Controller Endpoints
kubectl get endpoints -n arc-osm-system osm-controller
If the OSM Controller is healthy, you'll see output similar to the following:
NAME ENDPOINTS AGE
osm-controller 10.240.1.115:9092,10.240.1.115:15128 69m
If the user's cluster has no ENDPOINTS
for osm-controller
, the control plane is unhealthy. This unhealthy state may be caused by the OSM Controller pod crashing, or the pod may never have been deployed correctly.
Check OSM Injector Deployment
kubectl get deployments -n arc-osm-system osm-injector
If the OSM Injector is healthy, you'll see output similar to the following:
NAME READY UP-TO-DATE AVAILABLE AGE
osm-injector 1/1 1 1 73m
Check OSM Injector Pod
kubectl get pod -n arc-osm-system --selector app=osm-injector
If the OSM Injector is healthy, you'll see output similar to the following:
NAME READY STATUS RESTARTS AGE
osm-injector-5986c57765-vlsdk 1/1 Running 0 73m
The READY
column must be 1/1
. Any other value would indicate an unhealthy osm-injector pod.
Check OSM Injector Service
kubectl get service -n arc-osm-system osm-injector
If the OSM Injector is healthy, you'll see output similar to the following:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
osm-injector ClusterIP 10.0.39.54 <none> 9090/TCP 75m
Ensure the IP address listed for osm-injector
service is 9090
. There should be no EXTERNAL-IP
.
Check OSM Injector Endpoints
kubectl get endpoints -n arc-osm-system osm-injector
If the OSM Injector is healthy, you'll see output similar to the following:
NAME ENDPOINTS AGE
osm-injector 10.240.1.172:9090 75m
For OSM to function, there must be at least one endpoint for osm-injector
. The IP address of your OSM Injector endpoints will be different. The port 9090
must be the same.
Check Validating and Mutating webhooks
kubectl get ValidatingWebhookConfiguration --selector app=osm-controller
If the Validating webhook is healthy, you'll see output similar to the following:
NAME WEBHOOKS AGE
osm-validator-mesh-osm 1 81m
kubectl get MutatingWebhookConfiguration --selector app=osm-injector
If the Mutating webhook is healthy, you'll see output similar to the following:
NAME WEBHOOKS AGE
arc-osm-webhook-osm 1 102m
Check for the service and the CA bundle of the Validating webhook by using the following command:
kubectl get ValidatingWebhookConfiguration osm-validator-mesh-osm -o json | jq '.webhooks[0].clientConfig.service'
A well configured Validating webhook configuration will have output similar to the following:
{
"name": "osm-config-validator",
"namespace": "arc-osm-system",
"path": "/validate",
"port": 9093
}
Check for the service and the CA bundle of the Mutating webhook by using the following command:
kubectl get MutatingWebhookConfiguration arc-osm-webhook-osm -o json | jq '.webhooks[0].clientConfig.service'
A well configured Mutating webhook configuration will have output similar to the following:
{
"name": "osm-injector",
"namespace": "arc-osm-system",
"path": "/mutate-pod-creation",
"port": 9090
}
Check whether OSM Controller has given the Validating (or Mutating) webhook a CA Bundle by using the following command:
kubectl get ValidatingWebhookConfiguration osm-validator-mesh-osm -o json | jq -r '.webhooks[0].clientConfig.caBundle' | wc -c
kubectl get MutatingWebhookConfiguration arc-osm-webhook-osm -o json | jq -r '.webhooks[0].clientConfig.caBundle' | wc -c
Example output:
1845
The number in the output indicates the number of bytes, or the size of the CA Bundle. If this is empty, 0, or a number under 1000, the CA Bundle is not correctly provisioned. Without a correct CA Bundle, the ValidatingWebhook
will throw an error.
Check the osm-mesh-config
resource
Check for the existence of the resource:
kubectl get meshconfig osm-mesh-config -n arc-osm-system
Check the content of the OSM MeshConfig:
kubectl get meshconfig osm-mesh-config -n arc-osm-system -o yaml
apiVersion: config.openservicemesh.io/v1alpha1
kind: MeshConfig
metadata:
creationTimestamp: "0000-00-00A00:00:00A"
generation: 1
name: osm-mesh-config
namespace: arc-osm-system
resourceVersion: "2494"
uid: 6c4d67f3-c241-4aeb-bf4f-b029b08faa31
spec:
certificate:
certKeyBitSize: 2048
serviceCertValidityDuration: 24h
featureFlags:
enableAsyncProxyServiceMapping: false
enableEgressPolicy: true
enableEnvoyActiveHealthChecks: false
enableIngressBackendPolicy: true
enableMulticlusterMode: false
enableRetryPolicy: false
enableSnapshotCacheMode: false
enableWASMStats: true
observability:
enableDebugServer: false
osmLogLevel: info
tracing:
enable: false
sidecar:
configResyncInterval: 0s
enablePrivilegedInitContainer: false
logLevel: error
resources: {}
traffic:
enableEgress: false
enablePermissiveTrafficPolicyMode: true
inboundExternalAuthorization:
enable: false
failureModeAllow: false
statPrefix: inboundExtAuthz
timeout: 1s
inboundPortExclusionList: []
outboundIPRangeExclusionList: []
outboundPortExclusionList: []
kind: List
metadata:
resourceVersion: ""
selfLink: ""
osm-mesh-config
resource values:
Key | Type | Default Value | Kubectl Patch Command Examples |
---|---|---|---|
spec.traffic.enableEgress | bool | false |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"traffic":{"enableEgress":false}}}' --type=merge |
spec.traffic.enablePermissiveTrafficPolicyMode | bool | true |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"traffic":{"enablePermissiveTrafficPolicyMode":true}}}' --type=merge |
spec.traffic.outboundPortExclusionList | array | [] |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"traffic":{"outboundPortExclusionList":[6379,8080]}}}' --type=merge |
spec.traffic.outboundIPRangeExclusionList | array | [] |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"traffic":{"outboundIPRangeExclusionList":["10.0.0.0/32","1.1.1.1/24"]}}}' --type=merge |
spec.traffic.inboundPortExclusionList | array | [] |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"traffic":{"inboundPortExclusionList":[6379,8080]}}}' --type=merge |
spec.certificate.serviceCertValidityDuration | string | "24h" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"certificate":{"serviceCertValidityDuration":"24h"}}}' --type=merge |
spec.observability.enableDebugServer | bool | false |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"observability":{"enableDebugServer":false}}}' --type=merge |
spec.observability.osmLogLevel | string | "info" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"observability":{"tracing":{"osmLogLevel": "info"}}}}' --type=merge |
spec.observability.tracing.enable | bool | false |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"observability":{"tracing":{"enable":true}}}}' --type=merge |
spec.sidecar.enablePrivilegedInitContainer | bool | false |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"sidecar":{"enablePrivilegedInitContainer":true}}}' --type=merge |
spec.sidecar.logLevel | string | "error" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"sidecar":{"logLevel":"error"}}}' --type=merge |
spec.featureFlags.enableWASMStats | bool | "true" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableWASMStats":"true"}}}' --type=merge |
spec.featureFlags.enableEgressPolicy | bool | "true" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableEgressPolicy":"true"}}}' --type=merge |
spec.featureFlags.enableMulticlusterMode | bool | "false" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableMulticlusterMode":"false"}}}' --type=merge |
spec.featureFlags.enableSnapshotCacheMode | bool | "false" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableSnapshotCacheMode":"false"}}}' --type=merge |
spec.featureFlags.enableAsyncProxyServiceMapping | bool | "false" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableAsyncProxyServiceMapping":"false"}}}' --type=merge |
spec.featureFlags.enableIngressBackendPolicy | bool | "true" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableIngressBackendPolicy":"true"}}}' --type=merge |
spec.featureFlags.enableEnvoyActiveHealthChecks | bool | "false" |
kubectl patch meshconfig osm-mesh-config -n arc-osm-system -p '{"spec":{"featureFlags":{"enableEnvoyActiveHealthChecks":"false"}}}' --type=merge |
Check namespaces
Note
The arc-osm-system namespace will never participate in a service mesh and will never be labeled or annotated with the key/values below.
We use the osm namespace add
command to join namespaces to a given service mesh. When a Kubernetes namespace is part of the mesh, confirm the following:
View the annotations of the namespace bookbuyer
:
kubectl get namespace bookbuyer -o json | jq '.metadata.annotations'
The following annotation must be present:
{
"openservicemesh.io/sidecar-injection": "enabled"
}
View the labels of the namespace bookbuyer
:
kubectl get namespace bookbuyer -o json | jq '.metadata.labels'
The following label must be present:
{
"openservicemesh.io/monitored-by": "osm"
}
If you aren't using osm
CLI, you could also manually add these annotations to your namespaces. If a namespace isn't annotated with "openservicemesh.io/sidecar-injection": "enabled"
, or isn't labeled with "openservicemesh.io/monitored-by": "osm"
, the OSM Injector will not add Envoy sidecars.
Note
After osm namespace add
is called, only new pods will be injected with an Envoy sidecar. Existing pods must be restarted with kubectl rollout restart deployment
command.
Verify the SMI CRDs
Check whether the cluster has the required Custom Resource Definitions (CRDs) by using the following command:
kubectl get crds
Ensure that the CRDs correspond to the versions available in the release branch. For example, if you're using OSM-Arc v1.0.0-1, navigate to the SMI supported versions page and select v1.0 from the Releases dropdown to check which CRDs versions are in use.
Get the versions of the CRDs installed with the following command:
for x in $(kubectl get crds --no-headers | awk '{print $1}' | grep 'smi-spec.io'); do
kubectl get crd $x -o json | jq -r '(.metadata.name, "----" , .spec.versions[].name, "\n")'
done
If CRDs are missing, use the following commands to install them on the cluster. If you're using a version of OSM-Arc that's not v1.0, ensure that you replace the version in the command (for example, v1.1.0 would be release-v1.1).
kubectl apply -f https://raw.githubusercontent.com/openservicemesh/osm/release-v1.0/cmd/osm-bootstrap/crds/smi_http_route_group.yaml
kubectl apply -f https://raw.githubusercontent.com/openservicemesh/osm/release-v1.0/cmd/osm-bootstrap/crds/smi_tcp_route.yaml
kubectl apply -f https://raw.githubusercontent.com/openservicemesh/osm/release-v1.0/cmd/osm-bootstrap/crds/smi_traffic_access.yaml
kubectl apply -f https://raw.githubusercontent.com/openservicemesh/osm/release-v1.0/cmd/osm-bootstrap/crds/smi_traffic_split.yaml
To see CRD changes between releases, refer to the OSM release notes.
Troubleshoot certificate management
For information on how OSM issues and manages certificates to Envoy proxies running on application pods, see the OSM docs site.
Upgrade Envoy
When a new pod is created in a namespace monitored by the add-on, OSM will inject an Envoy proxy sidecar in that pod. If the Envoy version needs to be updated, follow the steps in the Upgrade Guide on the OSM docs site.
Feedback
Submit and view feedback for