Microsoft.Hpc.Activation.NodeManagerException - Master and compute nodes are disjoining from the domain automatically

zainul abiddin 1 Reputation point
2021-02-16T11:23:33.593+00:00

Hi Team,

Environment: Microsoft HPC PACK 2019 - Head node and Compute node

I am trying to submit the MPI job on master and compute nodes. For the first time I am able to submit the mpi job and get output.
However, the next run i am getting below error

Error from node: HEADNODE:System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The security database on the server does not have a computer account for this workstation trust relationshipException of type 'Microsoft.Hpc.Activation.NodeManagerException' was thrown. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper).

finally i check the event logs:
An unexpected exception occurred. For more information about this exception, see the Details tab.
Additional data:
We can't sign you in with this credential because your domain isn't available. Make sure your device is connected to your organization's network and try again. If you previously signed in on this device with another credential, you can sign in with that credential.

Exception detail: System.Security.SecurityException: We can't sign you in with this credential because your domain isn't available. Make sure your device is connected to your organization's network and try again. If you previously signed in on this device with another credential, you can sign in with that credential.

at System.Security.Principal.WindowsIdentity.KerbS4ULogon(String upn, SafeAccessTokenHandle& safeTokenHandle)
at System.Security.Principal.WindowsIdentity..ctor(String sUserPrincipalName, String type)
at System.Security.Principal.WindowsIdentity..ctor(String sUserPrincipalName)
at Microsoft.Hpc.Diagnostics.Controller.Utilities.ImpersonateWhenDomainJoinedT
at Microsoft.Hpc.Diagnostics.Controller.Utilities.CreateJob(ISchedulerStore store, String requestedBy, StoreProperty[] jobProps)
at Microsoft.Hpc.Diagnostics.Controller.PreStepFinishedHandler.ScheduleRunWithTaskResult(DiagnosticTestRun testRun, DiagnosticTest test, StepResult result)
at Microsoft.Hpc.Diagnostics.Controller.PreStepFinishedHandler.ExecuteInternal(DiagnosticTestRun testRun)
at Microsoft.Hpc.Diagnostics.Controller.StateHandlerBase.Execute()

now both master and compute nodes are removed automatically from the domain. finally, We have rejoined the both servers to domain and tested the mpi job. Again we are getting the same problem - both the servers are removed automatically from the domain.

Please help us to resolve this issue.

Regards,
Zain

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
8,110 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. zainul abiddin 1 Reputation point
    2021-02-19T06:43:02.29+00:00

    HI Team,

    This setup is On-premise 2-node Windows HPC Cluster (One Master and One Compute Nodes)

    Environment:
    Windows Server 2019 as base OS on top of that we are installed HPC Pack 2019 on Master node.
    Windows Server 2019 as base OS on top of that we are installed HPC Pack 2019 and added as compute node.
    Both the servers Status: Online and Health: OK

    HPC PACK 2019 server version : 6.0.7205.0 and client version : 6.0.7205.0

    Once we submit the job on both the nodes using HPC Job Manager we are getting below error:

    "System.ServiceModel.FaultException`1[Microsoft.Hpc.ExceptionWrapper]: The security database on the server does not have a computer account for this workstation trust relationshipException of type 'Microsoft.Hpc.Activation.NodeManagerException' was thrown. (Fault Detail is equal to Microsoft.Hpc.ExceptionWrapper)."

    at the same time we re-login the server, we are getting below error message on master and compute login screen as below:

    "The security database on the server does not have a computer account for this workstation trust relationship"

    Please help us to resolve this issue.

    Regards,
    Zain

    0 comments No comments

  2. Yutong Sun 261 Reputation points Microsoft Employee
    2021-05-28T05:52:29.28+00:00

    Hi Zain,

    It looks the machines lost the domain trust relationship. Please check the AD configurations and datetime settings of the machines. Also try to reset the machine password using 'netdom reset' or rejoin the machines to the domain.

    Regards,
    Yutong Sun

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.