Redploying due to host failure

Question

Redploying due to host failure

nimi 91

We got a resource health event for our azure vm "Redeployed due to host failure"
Please give me the answers of below questions one by one. I will be greateful for this.

What is the cause of this issue?

After Redeploying what happens to our vm?

Our VM got rebooted and it took 30 minutes for completion and what is the reason for that?

What is MS doing for avoid this type of issues because in AWS it is not happening?

How can be mitigate these type of issues?

Accepted answer

1 additional answer

Your answer

Answer 1

Alan Kinane 16,951 MVP Volunteer Moderator

What is the cause of this issue?
This should be listed under the service health service - it may take some time for the report to appear. https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview

After Redeploying what happens to our vm?
It just gets moved to a healthy node (physical host), you will experience some downtime while it is moved but that should be all. - https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/redeploy-to-new-node-windows

Our VM got rebooted and it took 30 minutes for completion and what is the reason for that?
The fabric controller automatically moves your VM to a healthy physical node which requires some downtime. The time it takes can depend on what you have deployed.

What is MS doing for avoid this type of issues because in AWS it is not happening?
Hardware components are always subject to failure so in this instance Microsoft have automatically moved your VM to a healthy host. I'm quite sure AWS has a very similar process, but I can't comment on how AWS manage fault tolerance. Something like this can happen to anyone and likewise I know customers who have never experienced this after many years of usage.

How can be mitigate these type of issues?
If you can't afford to risk any downtime then you would need to deploy multiple instances of your VMs using either availability sets or availability zones in order to spread your VMs across separate physical hosts. https://learn.microsoft.com/en-us/azure/architecture/example-scenario/infrastructure/iaas-high-availability-disaster-recovery

nimi 91 Reputation points

2022-03-16T13:53:59.873+00:00

Thankyou for your answers.

Could you please give me the below information.

Did these host failure happened because of anything happened at the datacenter?

Which component cause this issue disk or cpu or any network related issue or any other components?

Did Microsoft provide SLA for the downtime, if yes for how much time?
Alan Kinane 16,951 Reputation points MVP Volunteer Moderator

2022-03-16T14:21:34.103+00:00

Most likely this was due to a hardware component failure in the datacenter. Your report in service health should provide more details on the nature of your issue.

Here is the SLA details for Azure Virtual Machines: https://azure.microsoft.com/en-us/support/legal/sla/virtual-machines/v1_9/
Ramesh Badam 0 Reputation points

2023-04-23T16:44:17.7866667+00:00

What if Critical Server in Azure. Hardware on Microsoft Azure should not have such problems because there is no difference for migrating the Servers from on-premises to Azure Cloud. Azure should check why we have Hardware issues and what can be done to avoid such issues in future? My only ask is if anything can be done from Hardware front would really helpful for the customers who barely rely on Cloud and who can not afford multiple VM instances

Answer 2

nimi 91

Thankyou for your answers.
The fabric controller automatically moves your VM to a healthy physical node which requires some downtime. The time it takes can depend on what you have deployed.

Could you please elaborate this.

Which all the things depends upon it.

Also how did we know that which hardware component caused the failure. In that resource health event it is not mentioned.

Alan Kinane 16,951 Reputation points MVP Volunteer Moderator

2022-03-16T15:23:32.517+00:00

This article is worth reading - https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-disaster-recovery#handling-single-failures

Very difficult to summarise how this all works on here.

Check the service health history for more information but it can take up to 72 hours for a root cause analysis report to appear.
https://learn.microsoft.com/en-us/azure/service-health/resource-health-overview#root-cause-information
nimi 91 Reputation points

2022-03-16T16:35:57.083+00:00

Thankyou very much for answers.

Please give me a answer to this.

When deploying a vm, what all the factors will depends its time of deployment.
Alan Kinane 16,951 Reputation points MVP Volunteer Moderator

2022-03-16T16:58:40.887+00:00

If this is in relation to the automated redeployment you experienced. Have a look at the below. I can't give you an answer as to why this process took 30 minutes in your case but the below should explain the process that occured.

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/understand-vm-reboot#host-server-faults

Share via

Redploying due to host failure

1 additional answer

Your answer