Customize node configuration for Azure Kubernetes Service (AKS) node pools

Customizing your node configuration allows you to adjust operating system (OS) settings or kubelet parameters to match the needs of your workloads. When you create an AKS cluster or add a node pool to your cluster, you can customize a subset of commonly used OS and kubelet settings. To configure settings beyond this subset, you can use a daemon set to customize your needed configurations without losing AKS support for your nodes.

Create an AKS cluster with a customized node configuration

Create config files

OS and kubelet configuration changes require you to create a new configuration file with the parameters and your desired settings. If a value for a parameter is not specified, then the value will be set to the default.

Kubelet configuration

Create a linuxkubeletconfig.json file with the following contents:

{
 "cpuManagerPolicy": "static",
 "cpuCfsQuota": true,
 "cpuCfsQuotaPeriod": "200ms",
 "imageGcHighThreshold": 90,
 "imageGcLowThreshold": 70,
 "topologyManagerPolicy": "best-effort",
 "allowedUnsafeSysctls": [
  "kernel.msg*",
  "net.*"
],
 "failSwapOn": false
}

OS configuration

Create a linuxosconfig.json file with the following contents:

{
 "transparentHugePageEnabled": "madvise",
 "transparentHugePageDefrag": "defer+madvise",
 "swapFileSizeMB": 1500,
 "sysctls": {
  "netCoreSomaxconn": 163849,
  "netIpv4TcpTwReuse": true,
  "netIpv4IpLocalPortRange": "32000 60000"
 }
}

Create a new cluster using custom configuration files

When creating a new cluster, you can use the customized configuration files created in the previous steps to specify the kubelet configuration, OS configuration, or both.

Note

If you specify a configuration when creating a cluster, only the nodes in the initial node pool will have that configuration applied. Any settings not configured in the JSON file will retain the default value. CustomLinuxOsConfig isn't supported for OS type: Windows.

Create a new cluster using custom configuration files using the az aks create command and specifying your configuration files. The following example command creates a new cluster with the custom ./linuxkubeletconfig.json and ./linuxosconfig.json files:

az aks create --name myAKSCluster --resource-group myResourceGroup --kubelet-config ./linuxkubeletconfig.json --linux-os-config ./linuxosconfig.json

Add a node pool using custom configuration files

When adding a node pool to a cluster, you can use the customized configuration file created in the previous step to specify the kubelet configuration. CustomKubeletConfig is supported for Linux and Windows node pools.

Note

When you add a Linux node pool to an existing cluster, you can specify the kubelet configuration, OS configuration, or both. When you add a Windows node pool to an existing cluster, you can only specify the kubelet configuration. If you specify a configuration when adding a node pool, only the nodes in the new node pool will have that configuration applied. Any settings not configured in the JSON file will retain the default value.

az aks nodepool add --name mynodepool1 --cluster-name myAKSCluster --resource-group myResourceGroup --kubelet-config ./linuxkubeletconfig.json

Other configurations

The following settings can be used to modify other operating system settings:

Message of the Day

Pass the --message-of-the-day flag with the location of the file to replace the Message of the Day on Linux nodes at cluster creation or node pool creation.

az aks create --cluster-name myAKSCluster --resource-group myResourceGroup --message-of-the-day ./newMOTD.txt
Nodepool creation
az aks nodepool add --name mynodepool1 --cluster-name myAKSCluster --resource-group myResourceGroup --message-of-the-day ./newMOTD.txt

Confirm settings have been applied

After you apply custom node configuration, you can confirm the settings have been applied to the nodes by connecting to the host and verifying sysctl or configuration changes have been made on the filesystem.

Custom node configuration supported parameters

Kubelet custom configuration

Kubelet custom configuration is supported for Linux and Windows node pools. Supported parameters differ and are documented below.

Linux Kubelet custom configuration

Parameter Allowed values/interval Default Description
cpuManagerPolicy none, static none The static policy allows containers in Guaranteed pods with integer CPU requests access to exclusive CPUs on the node.
cpuCfsQuota true, false true Enable/Disable CPU CFS quota enforcement for containers that specify CPU limits.
cpuCfsQuotaPeriod Interval in milliseconds (ms) 100ms Sets CPU CFS quota period value.
imageGcHighThreshold 0-100 85 The percent of disk usage after which image garbage collection is always run. Minimum disk usage that will trigger garbage collection. To disable image garbage collection, set to 100.
imageGcLowThreshold 0-100, no higher than imageGcHighThreshold 80 The percent of disk usage before which image garbage collection is never run. Minimum disk usage that can trigger garbage collection.
topologyManagerPolicy none, best-effort, restricted, single-numa-node none Optimize NUMA node alignment, see more here.
allowedUnsafeSysctls kernel.shm*, kernel.msg*, kernel.sem, fs.mqueue.*, net.* None Allowed list of unsafe sysctls or unsafe sysctl patterns.
containerLogMaxSizeMB Size in megabytes (MB) 50 The maximum size (for example, 10 MB) of a container log file before it's rotated.
containerLogMaxFiles ≥ 2 5 The maximum number of container log files that can be present for a container.
podMaxPids -1 to kernel PID limit -1 (∞) The maximum amount of process IDs that can be running in a Pod

Windows Kubelet custom configuration

Parameter Allowed values/interval Default Description
imageGcHighThreshold 0-100 85 The percent of disk usage after which image garbage collection is always run. Minimum disk usage that will trigger garbage collection. To disable image garbage collection, set to 100.
imageGcLowThreshold 0-100, no higher than imageGcHighThreshold 80 The percent of disk usage before which image garbage collection is never run. Minimum disk usage that can trigger garbage collection.
containerLogMaxSizeMB Size in megabytes (MB) 10 The maximum size (for example, 10 MB) of a container log file before it's rotated.
containerLogMaxFiles ≥ 2 5 The maximum number of container log files that can be present for a container.

Linux custom OS configuration settings

Important

To simplify search and readability, the OS settings are displayed in this article by their name, but they should be added to the configuration JSON file or AKS API using camelCase capitalization convention.

For example, if you modify the 'vm.max_map_count setting', you should reformat to 'vmMaxMapCount' in the configuration JSON file.

File handle limits

When serving a lot of traffic, the traffic commonly comes from a large number of local files. You can adjust the below kernel settings and built-in limits to allow you to handle more, at the cost of some system memory.

Setting Allowed values/interval Default Description
fs.file-max 8192 - 12000500 709620 Maximum number of file-handles that the Linux kernel will allocate, by increasing this value you can increase the maximum number of open files permitted.
fs.inotify.max_user_watches 781250 - 2097152 1048576 Maximum number of file watches allowed by the system. Each watch is roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit kernel.
fs.aio-max-nr 65536 - 6553500 65536 The aio-nr shows the current system-wide number of asynchronous io requests. aio-max-nr allows you to change the maximum value aio-nr can grow to.
fs.nr_open 8192 - 20000500 1048576 The maximum number of file-handles a process can allocate.

Socket and network tuning

For agent nodes, which are expected to handle very large numbers of concurrent sessions, you can use the subset of TCP and network options below that you can tweak per node pool.

Setting Allowed values/interval Default Description
net.core.somaxconn 4096 - 3240000 16384 Maximum number of connection requests that can be queued for any given listening socket. An upper limit for the value of the backlog parameter passed to the listen(2) function. If the backlog argument is greater than the somaxconn, then it's silently truncated to this limit.
net.core.netdev_max_backlog 1000 - 3240000 1000 Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them.
net.core.rmem_max 212992 - 134217728 212992 The maximum receive socket buffer size in bytes.
net.core.wmem_max 212992 - 134217728 212992 The maximum send socket buffer size in bytes.
net.core.optmem_max 20480 - 4194304 20480 Maximum ancillary buffer size (option memory buffer) allowed per socket. Socket option memory is used in a few cases to store extra structures relating to usage of the socket.
net.ipv4.tcp_max_syn_backlog 128 - 3240000 16384 The maximum number of queued connection requests that have still not received an acknowledgment from the connecting client. If this number is exceeded, the kernel will begin dropping requests.
net.ipv4.tcp_max_tw_buckets 8000 - 1440000 32768 Maximal number of timewait sockets held by system simultaneously. If this number is exceeded, time-wait socket is immediately destroyed and warning is printed.
net.ipv4.tcp_fin_timeout 5 - 120 60 The length of time an orphaned (no longer referenced by any application) connection will remain in the FIN_WAIT_2 state before it's aborted at the local end.
net.ipv4.tcp_keepalive_time 30 - 432000 7200 How often TCP sends out keepalive messages when keepalive is enabled.
net.ipv4.tcp_keepalive_probes 1 - 15 9 How many keepalive probes TCP sends out, until it decides that the connection is broken.
net.ipv4.tcp_keepalive_intvl 10 - 75 75 How frequently the probes are sent out. Multiplied by tcp_keepalive_probes it makes up the time to kill a connection that isn't responding, after probes started.
net.ipv4.tcp_tw_reuse 0 or 1 0 Allow to reuse TIME-WAIT sockets for new connections when it's safe from protocol viewpoint.
net.ipv4.ip_local_port_range First: 1024 - 60999 and Last: 32768 - 65000] First: 32768 and Last: 60999 The local port range that is used by TCP and UDP traffic to choose the local port. Comprised of two numbers: The first number is the first local port allowed for TCP and UDP traffic on the agent node, the second is the last local port number.
net.ipv4.neigh.default.gc_thresh1 128 - 80000 4096 Minimum number of entries that may be in the ARP cache. Garbage collection won't be triggered if the number of entries is below this setting.
net.ipv4.neigh.default.gc_thresh2 512 - 90000 8192 Soft maximum number of entries that may be in the ARP cache. This setting is arguably the most important, as ARP garbage collection will be triggered about 5 seconds after reaching this soft maximum.
net.ipv4.neigh.default.gc_thresh3 1024 - 100000 16384 Hard maximum number of entries in the ARP cache.
net.netfilter.nf_conntrack_max 131072 - 1048576 131072 nf_conntrack is a module that tracks connection entries for NAT within Linux. The nf_conntrack module uses a hash table to record the established connection record of the TCP protocol. nf_conntrack_max is the maximum number of nodes in the hash table, that is, the maximum number of connections supported by the nf_conntrack module or the size of connection tracking table.
net.netfilter.nf_conntrack_buckets 65536 - 147456 65536 nf_conntrack is a module that tracks connection entries for NAT within Linux. The nf_conntrack module uses a hash table to record the established connection record of the TCP protocol. nf_conntrack_buckets is the size of hash table.

Worker limits

Like file descriptor limits, the number of workers or threads that a process can create are limited by both a kernel setting and user limits. The user limit on AKS is unlimited.

Setting Allowed values/interval Default Description
kernel.threads-max 20 - 513785 55601 Processes can spin up worker threads. The maximum number of all threads that can be created is set with the kernel setting kernel.threads-max.

Virtual memory

The settings below can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel and the writeout of dirty data to disk.

Setting Allowed values/interval Default Description
vm.max_map_count 65530 - 262144 65530 This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap, mprotect, and madvise, and also when loading shared libraries.
vm.vfs_cache_pressure 1 - 100 100 This percentage value controls the tendency of the kernel to reclaim the memory, which is used for caching of directory and inode objects.
vm.swappiness 0 - 100 60 This control is used to define how aggressive the kernel will swap memory pages. Higher values will increase aggressiveness, lower values decrease the amount of swap. A value of 0 instructs the kernel not to initiate swap until the amount of free and file-backed pages is less than the high water mark in a zone.
swapFileSizeMB 1 MB - Size of the temporary disk (/dev/sdb) None SwapFileSizeMB specifies size in MB of a swap file will be created on the agent nodes from this node pool.
transparentHugePageEnabled always, madvise, never always Transparent Hugepages is a Linux kernel feature intended to improve performance by making more efficient use of your processor’s memory-mapping hardware. When enabled the kernel attempts to allocate hugepages whenever possible and any Linux process will receive 2-MB pages if the mmap region is 2 MB naturally aligned. In certain cases when hugepages are enabled system wide, applications may end up allocating more memory resources. An application may mmap a large region but only touch 1 byte of it, in that case a 2-MB page might be allocated instead of a 4k page for no good reason. This scenario is why it's possible to disable hugepages system-wide or to only have them inside MADV_HUGEPAGE madvise regions.
transparentHugePageDefrag always, defer, defer+madvise, madvise, never madvise This value controls whether the kernel should make aggressive use of memory compaction to make more hugepages available.

Next steps