Azure Service Fabric Application stuck in rollback

Kaveh Hadjari 1 Reputation point
2020-06-25T07:12:17.63+00:00

We have a standalone service fabric installation hosted on a cluster of total 9 machines (5 management nodes + 2 frontend and 2 backend nodes).

While upgrading an application using the monitoredauto upgrademode an error occured that caused a rollback. The rollback is now stuck and has been for almost 24 hours. How can we recover from this state?

Our version of service fabric runtime is 7.0.470.9590

This is a dump when running Get-ServiceFabricApplicationUpgrade for the affected application:

ApplicationName               : fabric:/XXXXXXXX.ServiceFabric
ApplicationTypeName           : XXXXXXXX.ServiceFabricType
TargetApplicationTypeVersion  : 1.0.0.20200611.2
ApplicationParameters         : { "XXXXXXX_InstanceCount" = "-1";
                                "XXXXXXX_NodeType" = "(NodeType == BackEndNodeType)" }
StartTimestampUtc             : 2020-06-24 11:36:29
FailureTimestampUtc           : 2020-06-24 11:47:30
FailureReason                 : UpgradeDomainTimeout
UpgradeState                  : RollingBackInProgress
UpgradeDuration               : 00:11:00
CurrentUpgradeDomainDuration  : 00:00:00
NextUpgradeDomain             : UD6
UpgradeDomainsStatus          : { "UD0" = "Completed";
                                "UD1" = "Completed";
                                "UD2" = "Completed";
                                "UD3" = "Completed";
                                "UD4" = "Completed";
                                "UD5" = "Completed";
                                "UD6" = "Pending";
                                "UD7" = "Pending";
                                "UD8" = "Pending" }
UpgradeKind                   : Rolling
RollingUpgradeMode            : UnmonitoredAuto
ForceRestart                  : True
UpgradeReplicaSetCheckTimeout : 00:20:00

Following are events log from machine on UD5 (which is stated as complete in above query) and these keep repeating every minute or so.

Canceled pending requests for storeRelativePath:Store\XXXXXXXXXX.ServiceFabricType\XXXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81 count:1

Chunk download reply received for storeRelativePath:Store\XXXXXXXXXXX.ServiceFabricType\XXXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81 sequenceNumber:0 error:FABRIC_E_CANNOT_CONNECT retryCount:1

Redownload file chunks attempted 5 tries; failing the redownload operation. Number of chunks downloaded:0 remaining:1, storeRelativePath:Store\XXXXXXXX.ServiceFabricType\XXXXXXXXPkg.Code.1.0.0.20200624.1.checksum sessionId:16d82b1f-03f9-435f-abc6-e4d98deb5f81

End(BeginDownloadAndActivate): Error=HostingDeploymentInProgress, VersionedServiceTypeId={XXXXXXXXXXXX.ServiceFabricType_App42:XXXXXXXXXXXPkg:XXXXXXXXServiceType,1.0:1.37:131619518549526910}, ActivationContext=7fe20d1e-9054-4a5f-9f98-f0e8bff56d37, ServicePackagePublicActivationId=827f43f5-2cfb-4df0-bdaa-3b7b2c1568a4, SequenceNumber=88

Azure Service Fabric
Azure Service Fabric
An Azure service that is used to develop microservices and orchestrate containers on Windows and Linux.
253 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Kaveh Hadjari 1 Reputation point
    2020-06-29T11:00:40.247+00:00

    There was some glitch with the Service Fabric Image Service which would not handle delete or copying new application packages to it which caused the rollback and other deployments to fail as was confirmed by checking the event log on the machine which was primary node of Image Service.

    After we restarted the primary node of the Image Service the issue was resolved.