@Matias Firstly, apologies for the delay in responding on this and any inconvenience this issue may have caused
Below is the response I got from our internal team:
You should open Support case with Microsoft if you want to determine the exact cause if the following does not help you determine it.
Following are general answers for the questions.
Is it possible to find out why a node transitioned to NodeDown?
For Windows based clusters, it is possible to find some information for the node going down in Events tab for the node.
• Node Deactivation events should have information. BatchId field has some information based on if it contains the following
o Tenant Update / Tenant Maintenance – initiated by customer or on behalf of the customer through VMSS
o Platform Update / Platform Maintenance – initiated by Azure for updating some underlying infrastructure
o POA / POS – Initiated by Patch Orchestration service deployed by the customer to install OS updates
o Client – Initiated by the customer by calling one of the SF commands to deactivate the node
• No Deactivation events, but has Node Closed
o SF closed the node for some reason. Currently it is not possible to determine the cause without looking at the detailed SF traces
• No Deactivation or close events before Node Down event – this typically means unplanned event has happened
o Networking issues causing node to not be able to communicate with other nodes
o Underlying VM was terminated ungracefully
o Automatic Windows Update is enabled in the OS – this will result in ungraceful restart of the VM from a SF point of view
Use VMSS automatic OS upgrade configuration to convert this to graceful operations in SF. This is the preferred option
POA is an alternative if OS upgrade should be scheduled at specific time. Automatic OS upgrade currently does not allow specifying a schedule to apply updates
Will Azure Service Fabric bring down nodes for maintenance whenever they feel like?
No,Service Fabric will bring down the node for maintenance only for cluster upgrades. Details tab at cluster level should show the duration of the upgrade and Events at Cluster level should show the exact time UD was updated. Corresponding Node Close & Node down events during the period will be due to SF upgrade
• SF version upgrade is initiate automatically if the settings for the cluster specify Automatic Upgrade. These typically happen week days business hours PST / PDT
• Other Customer initiated SF config upgrades can also bring down a node to apply the change
Are nodes of the primary node type more prone to be restarted by Azure or Service Fabric?
No
Any general advice or pointers are appreciated as well!
Following are other reasons we have seen nodes go down
• High resource usage that makes the VM not responsive or slow can trigger a VM reboot at the OS level
• High resource usage that makes the VM not responsive or slow that prevents SF services running on the VM to not be able to communicate with other nodes
• Automatic Windows Updates enabled in the OS
Hope it helps.
Please 'Accept as answer' if it helped, so that it can help others in the community looking for help on similar topics