Hi,
many thanks for your reply and your efforts to reproduce the issue. Here are the outputs from my system:
crictl ps|egrep rabbit\|wordpress
55c0e11ff37f4 87505dc99f218 10 days ago Running rabbitmq 0 11b08bab81bdc
a8e4aeabe747e cfb931188dab8 2 weeks ago Running wordpress 0 a64c38b0ee2ef
crictl inspect 55c0e11ff37f4 a8e4aeabe747e | grep pvc | grep hostPath
"hostPath": "/var/lib/kubelet/pods/96efafa7-660e-41a9-800f-8e5089c582e4/volumes/kubernetes.io~azure-disk/pvc-69707b58-30d9-4001-9004-76bbd0db1a36",
"hostPath": "/var/lib/kubelet/pods/8f218053-7c93-4c75-97b2-e55e06f417e4/volumes/kubernetes.io~azure-disk/pvc-c66de91e-602c-4934-b513-b930993ad8b7",
I'm afraid I can't do the diff on the vol_data.json
, because it doesn't exist. It appears that it was introduced in a more recent version. The cluster that has the issue is on the slightly out of date version v1.19.9. I find the file on one of my newer clusters that run on v1.22. I realise I need to upgrade, but I'm a bit worried that this will make things worse.
df -h |grep pvc
doesn't show me the volumes, but I can find it with:
root@aks-a2m-15622180-vmss000002:/# df -h |grep '/dev/sd'
/dev/sda1 97G 21G 77G 21% /
/dev/sda15 105M 4.4M 100M 5% /boot/efi
/dev/sde1 20G 44M 19G 1% /mnt
/dev/sdf 4.8G 615M 4.2G 13% /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m716043828
/dev/sdd 148G 26G 122G 18% /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m1670688361
It's the /dev/sdf
disk. ls -l /var/lib/kubelet/plugins/kubernetes.io/azure-disk/mounts/m716043828
shows me the files I can see in the wordpress and rabbitmq container. The /dev/sdd
disk is the other Wordpress pod (and a different deployment), which also suddenly shares storage with an Elasticsearch pod (same issue, different pair of pods).
root@aks-a2m-15622180-vmss000002:/# blkid
/dev/sda1: LABEL="cloudimg-rootfs" UUID="5a9997c3-aafd-46e9-954c-781f2b11fb68" TYPE="ext4" PARTUUID="cbc2fcb7-e40a-4fec-a370-51888c246f12"
/dev/sda15: LABEL="UEFI" UUID="2FBA-C33A" TYPE="vfat" PARTUUID="53fbf8ed-db79-4c52-8e42-78dbf30ff35c"
/dev/sdb: UUID="2bf2a47b-e77e-4648-9ba6-7fcd0cb2a1cd" TYPE="ext4"
/dev/sdc: UUID="64e259b0-9ca4-421f-b23a-90b304bcb383" TYPE="ext4"
/dev/sdd: UUID="3c76aea0-b4e1-4013-a1b0-ff6b67bc88df" TYPE="ext4"
/dev/sde1: UUID="9b50c84e-44f9-4fd8-b4b1-32ab7d16de1e" TYPE="ext4" PARTUUID="dc2b2422-01"
/dev/sda14: PARTUUID="de01bd39-4bfe-4bc8-aff7-986e694f7972"
/dev/sdf: UUID="c4ba423b-1ada-4ff4-bfe6-098caf5fdb08" TYPE="ext4"
Some additional information: The problem started on the 16th of November, after I drained and rebooted all the nodes in turn. The cluster is almost 2 years old and I have patched and upgraded it serveral times without major issues. The configuration of the affected deployments/stateful sets hasn't changed in over a year.
Some of the PVs took ages to re-attach after the reboot (about 30 - 45 minutes). All I did when I tried to resolve this was to delete the pods occasionally and they attached eventually. Afterwards thinks looked OK until I discovered that some pods share a volume. All the affected nodes are in availability zone 3.
Any ideas what I could do to fix this?
Thanks again for your help with this!
Karsten