Should I expect Node Down in Azure Service Fabric

Matias 21 Reputation points
2020-11-17T13:18:30.037+00:00

During the last month I have observed a new behavior, nodes transition to state NodeDown and a minute later back to NodeUp. The virtual machine is not restarted so I am speculating that Azure restarts the local Service Fabric processes for some reason.

When this happens two problems occur

  1. Some services are handling continuous media streams that are terminated - not good at all.
  2. When the affected node is up again no services are moved back to it - not good in the long run since it leaves the load unevenly distributed.

Questions

  1. I it possible to find out why a node transitioned to NodeDown?
  2. Will Azure Service Fabric bring down nodes for maintenance whenever they feel like?
  3. Are nodes of the primary node type more prone to be restarted by Azure or Service Fabric?
  4. Any general advice or pointers are appreciated as well!

Cluster configuration
Durability: Silver
Reliability: Silver
One node type (primary)
9 Virtual Machines (5 Seed nodes)
~50 applications
~300 services (stateless guest executables)

Azure Service Fabric
Azure Service Fabric
An Azure service that is used to develop microservices and orchestrate containers on Windows and Linux.
264 questions
{count} votes

Accepted answer
  1. prmanhas-MSFT 17,901 Reputation points Microsoft Employee
    2020-11-22T15:30:53.303+00:00

    @Matias Firstly, apologies for the delay in responding on this and any inconvenience this issue may have caused

    Below is the response I got from our internal team:

    You should open Support case with Microsoft if you want to determine the exact cause if the following does not help you determine it.

    Following are general answers for the questions.

    Is it possible to find out why a node transitioned to NodeDown?

    For Windows based clusters, it is possible to find some information for the node going down in Events tab for the node.
    • Node Deactivation events should have information. BatchId field has some information based on if it contains the following
    o Tenant Update / Tenant Maintenance – initiated by customer or on behalf of the customer through VMSS
    o Platform Update / Platform Maintenance – initiated by Azure for updating some underlying infrastructure
    o POA / POS – Initiated by Patch Orchestration service deployed by the customer to install OS updates
    o Client – Initiated by the customer by calling one of the SF commands to deactivate the node
    • No Deactivation events, but has Node Closed
    o SF closed the node for some reason. Currently it is not possible to determine the cause without looking at the detailed SF traces
    • No Deactivation or close events before Node Down event – this typically means unplanned event has happened
    o Networking issues causing node to not be able to communicate with other nodes
    o Underlying VM was terminated ungracefully
    o Automatic Windows Update is enabled in the OS – this will result in ungraceful restart of the VM from a SF point of view
     Use VMSS automatic OS upgrade configuration to convert this to graceful operations in SF. This is the preferred option
     POA is an alternative if OS upgrade should be scheduled at specific time. Automatic OS upgrade currently does not allow specifying a schedule to apply updates

    Will Azure Service Fabric bring down nodes for maintenance whenever they feel like?

    No,Service Fabric will bring down the node for maintenance only for cluster upgrades. Details tab at cluster level should show the duration of the upgrade and Events at Cluster level should show the exact time UD was updated. Corresponding Node Close & Node down events during the period will be due to SF upgrade
    • SF version upgrade is initiate automatically if the settings for the cluster specify Automatic Upgrade. These typically happen week days business hours PST / PDT
    • Other Customer initiated SF config upgrades can also bring down a node to apply the change

    Are nodes of the primary node type more prone to be restarted by Azure or Service Fabric?

    No

    Any general advice or pointers are appreciated as well!

    Following are other reasons we have seen nodes go down
    • High resource usage that makes the VM not responsive or slow can trigger a VM reboot at the OS level
    • High resource usage that makes the VM not responsive or slow that prevents SF services running on the VM to not be able to communicate with other nodes
    • Automatic Windows Updates enabled in the OS

    Hope it helps.

    Please 'Accept as answer' if it helped, so that it can help others in the community looking for help on similar topics


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.