Share via


SQL Server AG Authentication issue

Question

Monday, June 17, 2019 5:23 PM

SQL Server 2016 running on Windows Server 2012 participating in three replica availability group. All three nodes using gMSA for SQL Server services. gMSA has CONNECT to the instance and the endpoint. System's been working smoothly for over a year.

two out of three nodes behaving normally following windows patching and reboot.

third node can no longer connect to endpoints on other nodes, reporting :

Login failed for user 'domain\machine$'. Reason: Token-based server access validation failed with an infrastructure error. Login lacks Connect SQL permission. [CLIENT: <ip address>].

As I say, the sql services are running under a group managed service account and the AG's have behaved normally over the last year, which includes routine OS patching and restarts.

The computer object in the domain is not disabled nor is the SQL gMSA. The VCO is enabled. 

Obvious remedy is to grand the machine account connect (instance and endpoint) on each node, but that doesn't answer they why this happened and what's going on.

I've restarted the SQL Service several times, even rebooted. SQL Server service comes back up just fine, but I can't get the primary replica on this AG to come up since it can't connect to the secondaries. Nor can I create new AGs for the same reason. I have deleted this particular AG and restored with recovery the databases to get the application back on line. If I can't figure this out then I'm left with building a new cluster node.

I'm not entirely sure I understand the error, but it seems as though the machine can't access the service account. Anyone have any suggestions? 

All replies (2)

Tuesday, June 18, 2019 6:04 AM

Hi electronjockey,

 

>>Obvious remedy is to grand the machine account connect (instance and endpoint) on each node, but that doesn't answer they why this happened and what's going on.

 

Would you please show us some log message? We cannot  judge what happened is only from your description and our guess.

 

Error log on both  primary replica and secondary replica(C:\Program Files\Microsoft SQL Server\MSSQL13.<instance_name>\MSSQL\Log) and the  windows system log in secondary replica ("Event Viewer"->"windows logs"->"system")and Windows Server cluster logs for all nodes:

 You can get the cluster via the below steps:

 

(1) Connect to the Windows cluster current host server.

(2) Run the power shell as administrator.

(3) Run the commands in a PowerShell window with Administrative privileges. C:\log is the location for these logs, you can change it.

 Import-Module FailoverClusters Get-ClusterLog -Destination C:\Log

 

Best regards,

Dedmon Dai

MSDN Community Support
Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com


Tuesday, June 18, 2019 2:46 PM

So here's the relevant section:

000007a4.00000db0::2019/06/17-12:46:36.357 INFO  rcm::RcmResource::OnlineWorker[RCM] Issuing Online(Sharepoint_AG) to RHS.
000007a4.00000adc::2019/06/17-12:46:36.357 INFO  [RCM] HandleMonitorReply: ONLINERESOURCE for 'Sharepoint_AG', gen(25053) result 997/0.
000007a4.00000adc::2019/06/17-12:46:36.357 INFO  [RCM] Res Sharepoint_AG: OnlineCallIssued -> OnlinePending( StateUnknown )
000007a4.00000adc::2019/06/17-12:46:36.357 INFO  [RCM] TransitionToState(Sharepoint_AG) OnlineCallIssued-->OnlinePending.
00000ff8.00001418::2019/06/17-12:46:36.357 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] The DeadLockTimeout property has a value of 300000
00000ff8.00001418::2019/06/17-12:46:36.357 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] The PendingTimeout property has a value of 180000
00000ff8.00001418::2019/06/17-12:46:36.357 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] Connect to SQL Server ...
00000ff8.00001418::2019/06/17-12:46:36.373 ERR   [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] ODBC Error: [28000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Login failed for user 'DOMAIN\NODE02$'. (18456)
00000ff8.00001418::2019/06/17-12:46:36.373 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] Could not connect to SQL Server (rc -1)
00000ff8.00001418::2019/06/17-12:46:36.373 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] SQLDisconnect returns following information
00000ff8.00001418::2019/06/17-12:46:36.373 ERR   [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] ODBC Error: [08003] [Microsoft][ODBC Driver Manager] Connection not open (0)
00000ff8.00001418::2019/06/17-12:46:36.373 ERR   [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] Failed to connect to SQL Server
00000ff8.00001418::2019/06/17-12:46:36.373 ERR   [RHS] Online for resource Sharepoint_AG failed.
000007a4.00000104::2019/06/17-12:46:36.373 WARN  [RCM] HandleMonitorReply: ONLINERESOURCE for 'Sharepoint_AG', gen(25053) result 5018/0.
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Res Sharepoint_AG: OnlinePending -> ProcessingFailure( StateUnknown )
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] TransitionToState(Sharepoint_AG) OnlinePending-->ProcessingFailure.
000007a4.00000104::2019/06/17-12:46:36.373 ERR   [RCM] rcm::RcmResource::HandleFailure: (Sharepoint_AG)
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] resource Sharepoint_AG: failure count: 2707, restartAction: 2 persistentState: 1.
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] numDependents is zero, auto-returning true
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Resource Sharepoint_AG is causing group Sharepoint_AG to failover.
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] rcm::RcmGroup::Failover: (Sharepoint_AG)
000007a4.00000104::2019/06/17-12:46:36.373 WARN  [RCM] Not failing over group Sharepoint_AG, failoverCount 171357, failoverThresholdSetting 4294967295, lastFailover 2019/06/17-04:35:56.922
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Will retry online from long delay restart of Sharepoint_AG in 3600000 milliseconds.
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Sharepoint_AG_10.1.142.52 merits neither a notification nor a delayed restart
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Sharepoint_AG_EC9PRPBSQLHA01B merits neither a notification nor a delayed restart
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Res Sharepoint_AG: ProcessingFailure -> WaitingToTerminate( Failed )
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] TransitionToState(Sharepoint_AG) ProcessingFailure-->[WaitingToTerminate to Failed].
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] Res Sharepoint_AG: [WaitingToTerminate to Failed] -> Terminating( Failed )
000007a4.00000104::2019/06/17-12:46:36.373 INFO  [RCM] TransitionToState(Sharepoint_AG) [WaitingToTerminate to Failed]-->[Terminating to Failed].
00000ff8.00001168::2019/06/17-12:46:36.373 INFO  [RES] SQL Server Availability Group <Sharepoint_AG>: [hadrag] Online operation was not issued for the availability group. Simply returning from the Terminate call without the offline operation
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] HandleMonitorReply: TERMINATERESOURCE for 'Sharepoint_AG', gen(25054) result 0/0.
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] Res Sharepoint_AG: [Terminating to Failed] -> Failed( StateUnknown )
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] TransitionToState(Sharepoint_AG) [Terminating to Failed]-->Failed.
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] rcm::RcmGroup::UpdateStateIfChanged: (Sharepoint_AG, Pending --> Failed)
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] moved 0 tasks from staging set to task set.  TaskSetSize=0
000007a4.00000a90::2019/06/17-12:46:36.373 INFO  [RCM] rcm::RcmPriorityManager::StartGroups: [RCM] done, executed 0 tasks
000007a4.000009dc::2019/06/17-12:46:36.373 INFO  [RCM] ignored non-local state Failed for group Sharepoint_AG
000007a4.00000104::2019/06/17-12:46:36.404 INFO  [RCM] rcm::RcmApi::AddPossibleOwner: (Sharepoint_AG, 3)
000007a4.00000104::2019/06/17-12:46:36.404 INFO  [GUM] Node 3: executing request locally, gumId:669, my action: /rcm/gum/AddPossibleOwner, # of updates: 1
000007a4.00000104::2019/06/17-12:46:36.404 INFO  [RCM] rcm::RcmGum::AddPossibleOwner(Sharepoint_AG,3)
000007a4.00000104::2019/06/17-12:46:36.404 ERR   mscs::GumAgent::ExecuteHandlerLocally: (5010)' because of 'The specified node is already a possible owner.'
000007a4.00000104::2019/06/17-12:46:36.404 WARN  [DM] Aborting group transaction 24:24:51025+1
000007a4.00000104::2019/06/17-12:46:36.404 ERR   [RCM] rcm::RcmApi::AddPossibleOwner: (5010)' because of 'Gum handler completed as failed'

Another symptom is that if I run setspn -q for the service from the troubled node, I get the following error:

Ldap Error(0x31 -- Invalid Credentials): ldap_bind_sW
Failed to retrieve DN for domain "" : 0x00000031

I found Edwin's post : https://www.edwinmsarmiento.com/why-we-need-to-understand-how-active-directory-affects-sql-server-high-availability/ That I thought might take me in the right direction so I've deleted and recreated the VCO as well, but just trying to create an empty AG with "CREATE AVAILABILITY GROUP my_ag ..." fails with the machine name authentication, when run from the troubled node. I can create an AG from either of the other nodes and even join this node to the new AG, but cannot failover to it.

Right now I'm leaning toward an issue with the machine account in AD, but I've got no evidence on the DC of any problems.

I'm the most experienced with clusters in my office, but this one has me stumped. I was hoping that I might be able to correct the issue, but it's looking more and more like I'll have to R & R this node.