Fix known issues and errors when managing storage in AKS Arc

Use this article to help you troubleshoot and resolve storage-related issues in AKS Arc.

Configuring persistent volume claims results in the error: "Unable to initialize agent. Error: mkdir /var/log/agent: permission denied"

This permission denied error indicates that the default storage class may not be suitable for your workloads and occurs in Linux workloads running on top of Kubernetes version 1.19.x or later. Following security best practices, many Linux workloads specify the securityContext fsGroup setting for a pod. The workloads fail to start on AKS on Azure Stack HCI since the default storage class doesn't specify the fstype (=ext4) parameter, so Kubernetes fails to change the ownership of files and persistent volumes based on the fsGroup requested by the workload.

To resolve this issue, define a custom storage class that you can use to provision PVCs.

Container storage interface pod stuck in a 'ContainerCreating' state

A new Kubernetes workload cluster was created with Kubernetes version 1.16.10 and then updated to 1.16.15. After the update, the csi-msk8scsi-node-9x47m pod was stuck in the ContainerCreating state, and the kube-proxy-qqnkr pod was stuck in the Terminating state as shown in the output below:

Error: kubectl.exe get nodes  
NAME              STATUS     ROLES    AGE     VERSION 
moc-lf22jcmu045   Ready      <none>   5h40m   v1.16.15 
moc-lqjzhhsuo42   Ready      <none>   5h38m   v1.16.15 
moc-lwan4ro72he   NotReady   master   5h44m   v1.16.15

\kubectl.exe get pods -A 

NAMESPACE     NAME                        READY   STATUS              RESTARTS   AGE 
    5h38m 
kube-system   csi-msk8scsi-node-9x47m     0/3     ContainerCreating   0          5h44m 
kube-system   kube-proxy-qqnkr            1/1     Terminating         0          5h44m  

Since kubelet ended up in a bad state and can no longer talk to the API server, the only solution is to restart the kubelet service. After restarting, the cluster goes into a running state.

Disk storage filled up from crash dump logs

Disk storage can be filled up from crash dump logs that are created. This is due to an expired Geneva agent client certificate. The symptoms can be as follows:

  • Services fail to start.
  • Kubernetes pods, deployments, etc. fail to start due insufficient resources.

Important

This issue can impact all new Mariner management and target cluster nodes created after April 18, 2023 on releases from April 2022 to March 2023. The issue is fixed in the 2023-05-09 release and later.

This issue can impact any operation that involves allocating disk space or writing new files, so any "insufficient disk space/resources" error is a good hint. To check if this issue is present on a given node, run the following shell command:

clouduser@moc-lwm2oudnskl $ sudo du -h /var/lib/systemd/coredump/

This command reports the storage space consumed by the diagnostic files.

Root cause

The expiration of the client certificate used to authenticate the Geneva agent to the service endpoint causes the agent to crash, resulting in a crash dump. The agent's crash/retry loop is about 5 seconds at initial startup, and there is no timeout. This means that a new file (about 330MB) is created on the node's file system every few seconds, which can rapidly consume disk storage.

Mitigation

The preferred mitigation is to upgrade to the latest release, version 1.10.18.10425, which has an updated certificate. To do so, first manually upgrade your workload clusters to any supported minor version before you update your AKS-HCI host.

For more information about AKS Arc releases, and all the latest AKS-HCI news, subscribe to the AKS releases page.

If upgrading is not an option, you can turn off the mdsd service. For each Mariner node:

  1. Turn off the Geneva agent with the following shell commands:

    sudo systemctl disable --now mdsd
    
  2. Verify that the Geneva agent was successfully disabled:

    sudo systemctl status mdsd
    
  3. Delete accumulated files with the following command:

    sudo find /var/lib/systemd/coredump/ -type f -mmin +1 -exec rm -f {} \;
    sudo find /run/systemd/propagate -name 'systemd-coredump@*' -delete
    sudo journalctl --rotate && sudo journalctl --vacuum-size=500M
    
  4. Reboot the node:

    sudo reboot
    

Storage pod crashes and the logs say that the `createSubDir` parameter is invalid

An error can occur if you have an SMB or NFS CSI driver installed in your deployment and you upgrade to the May build from an older version. One of the parameters, called createSubDir, is no longer accepted. If this applies to your deployment, follow the instructions below to resolve the storage class failure.

If you experience this error, the storage pod crashes and the logs indicate that the createSubDir parameter is invalid.

Recreate the storage class.

When creating a persistent volume, an attempt to mount the volume fails

After deleting a persistent volume or a persistent volume claim in an AKS Arc environment, a new persistent volume is created to map to the same share. However, when attempting to mount the volume, the mount fails, and the pod times out with the error, NewSmbGlobalMapping failed.

To work around the failure to mount the new volume, you can SSH into the Windows node and run Remove-SMBGlobalMapping and provide the share that corresponds to the volume. After running this command, attempts to mount the volume should succeed.