Failover Cluster DNS error, event 1257 keeps coming back

James Edmonds 816

Hi,

I have two failover clusters created, for which I did NOT pre-create the DNS records for the cluster or role names.
The DNS records are created when the cluster/role is brought online.

On one of the clusters, I keep getting event ID 1257, where it fails to register or update the DNS entry for the role running on the cluster.
If I delete the existing record, and restart the role, it creates successfully.

I am trying to understand why, if the cluster creates the record, this error keeps coming back?
What can I do to prevent this from constantly complaining about this, when both cluster nodes have access to that DNS record?

Thanks
James

James Edmonds 816 Reputation points

2022-06-15T10:16:32.077+00:00

My ocurrence of this issue just started again this morning.
I am able to resolve by taking the clustered role name resource offline and onlining it again.

What I have noticed, is that for a clustered role on another cluster, the permissions on the DNS entry for the role name resource lists the cluster name resource as having permissions.
On the one that was having issues this morning, it shows the role name resource as having permissions on itself, rather than the cluster name resource.

Should, when onlining the role name resource,, the cluster name object be what is given permissions on the DNS entry, rather than the role name object?

Banging my head against a brick wall here and started to get a bit frustrated.

Cheers
James
Jack Dobiash 6 Reputation points

2022-06-15T17:54:43.587+00:00

Hey James, I checked my cluster objects and at least in our version (Server 2016), the actual Cluster itself 'owns' and updates the DNS objects on itself and on any roles that it is hosting. If you check the 'Advanced' security area of the DNS record, it should show you who owns it as well. We haven't had any issues with the individual roles themselves (so far), just the main CNO object is what isn't updating it's DNS record.

I did notice that the pwdLastSet timestamp of my CNO is different than the individual roles (although they are within a day or so of each other), meaning they don't all get updated at exactly the same time. Did the password on that role object recently get updated? Just curious if your issue is similar to mine.

Lastly, if you check the Diagnostic Failover Cluster event logs, you might be able to see something that occurs at the time when the DNS renewal is attempting. Those logs are pretty bloated, however, so it can be a bit of a pain to find stuff (try and filter out 'information' events), and usually they rollover pretty quickly (if the log size isn't increased) so you almost need to find it while the event is occurring.

Hope this helps!
James Edmonds 816 Reputation points

2022-06-16T10:59:24.99+00:00

Hi Jack,

I'm interested in what roles you are hosting?
I think my issue is that, for both my CNOs and my SQL role DNS name resource, the owners of the DNS objects are the cluster name objects themselves.
In the case of my problematic file server role, the owner is the role itself.

See below. Top two entries are the CNOs. Bottom left is the problematic file server role, and bottom right is SQL role:

It seems as though when the file server role DNS name resource is brought online, it is not setting ownership of the DNS entry correctly?

I am not sure if my password reset times suggest my issue is the same as yours?

Cheers
James
Jack Dobiash 6 Reputation points

2022-06-28T00:24:44.247+00:00

So aside from Hyper-V, the two 'Roles' we have on our non-HyperV cluster is DHCP and the 'Generic' one running Certificate Services. No issues with either of those. Our main CNO Name issue came back today on our Hyper-V cluster, right on schedule (22 days between password changes).

I just got done doing some more testing and the results are 'interesting' to say the least. Instead of just stopping and starting the 'Name' role to fix it again, I decided to actually delete the DNS record right before it was about to attempt to renew it. To my surprise, it registered the new record in DNS, but really to my surprise, the 'owner' of the record was no longer the CNO, but one of the 'Roles' on the Cluster, the Hyper-V Replica Role, to be precise. Keep in mind this is the DNS record for the actual CNO, not the role, but yet it used the Replica Role account to register the DNS name. The Replica role has it's own DNS record, which is owned by the main CNO, and it's been updating fine this whole time.

As another test, I've now 'stop/started' the 'Name' role on the cluster, and I'm going to see if, when it updates the DNS record in ~24 hours, it fails again. I suspect it will, as then it will be trying to use the proper Cluster account name again and it won't have access anymore.

It'll be interesting to see what happens when I try this on my other non-Hyper-V cluster, once it starts having the problem (which should be here in the next day or so).
James Edmonds 816 Reputation points

2022-06-28T09:04:04.817+00:00

I think this too is what we are seeing on one of our clusters.
The owner of the DNS record after deleting it manually, and having the cluster recreate it by bringing it offline and online, is one of the roles rather than the cluster itself.

If you're environment then behaves like mine, it'll be fine for ages, then at some yet to be determined interval, it will start triggering event ID 1257, saying it cannot be updated.

I'm still monitoring ours, but due to a number of power outages recently, our clusters have been fully stopped a number of times, so I have no idea what the real interval between event ID 1257s is at the moment.

Let me know how yours goes.

Cheers
James
Jack Dobiash 6 Reputation points

2022-06-30T02:03:21.46+00:00

So our other Cluster was due for the problem as well, so I waited until it started throwing out event ID 1257 again, then just deleted the DNS record for that clusters main CNO. Similar to before, once it tried again (it tries every 15 mins once it fails) it registered in DNS, but now the "DHCP" role owns it. It's almost like I'm having the exact opposite issue as you, where my main CNO DNS object is now being 'owned' by the roles it maintains, rather than by itself. So far, no errors have ever happened when attempting to register the roles DNS, but it might be that I'm not waiting long enough after the main CNO quits being able to update itself.

If I delete the DNS record, and then stop/start the 'Name' role on the cluster itself, it re-registers the DNS record under it's own name properly. The next time the password is updated is when it switches to one of the roles.

I'm still leaning towards the thought I had when I originally posted, which was that my cluster basically 'forgets' how to login under it's own name after it changes the password on the CNO, but somehow it's falling over to using one of the roles credentials. I suppose I can just add both the roles and the main CNO object as being able to write to the DNS record. This will probably make it work, however I would still consider this a 'workaround' to a bug that was introduced within the last 6 months or so.
James Edmonds 816 Reputation points

2022-07-01T18:46:31.437+00:00

Yes, to me it feels like a bug, as the cluster CNO should take ownership of its own DNS record.
I can see no logical reason it wouldn't do that, unless there was a bug that meant it didn't try, or skipped to using the role CNO if the cluster is not able to for some reason.

Very odd.

I cannot raise with Microsoft sadly, as we have no premier support, but perhaps if enough people notice the issue and post here, they may look into it.

Cheers
James

15 answers

Marcus Abarbanel 6 Reputation points

2022-09-16T20:51:29.517+00:00

The silence from Microsoft on this is deafening. To think, an enterprise level feature, is basically broken, and has been broken for a while now. I just closed a ticket with Microsoft support regarding this issue and they were of no help at all. The tech did indicate that this is a known issue with Microsoft but could not give me an ETA on a fix/patch for this issue. Also, he refused to produce the 'internal' documentation from Microsoft regarding this 'Known Issue'. The real kicker here is that this issue can be resolved by offlining the cluster, running the Update-ClusterNetworkNameResource (which works sporadically) command via PowerShell, execute a repair on the cluster, and/or sometimes it self-heals. Because of this the Microsoft tech always has an out, as they are technically break/fix and if an issue isn't broken they refuse to help, and tell you to submit a root cause analysis ticket to Microsoft Premier support. My argument is that these issues are SYMPTOMS of the real issue. If Microsoft knows the reason/cause for this issue they need to provide information on what causes this issue, HOW TO FIX IT, and/or a patch or ETA on when a patch will be published to resolve this issue.
Please sign in to rate this answer.

1 person found this answer helpful.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Steve Cahill 11 Reputation points

2022-11-23T03:20:55.307+00:00

I logged a case with Microsoft last year and the issue was acknowledged and fixed in Nov 21 KB5007266: "Addresses an issue that prevents Failover Clustering from updating Domain Name Server (DNS) records".

The issue then reared its head in Feb 2022 but it is intermittent for us depending on which node in which site is rebooted in which order etc. (patching normally). I finally gave in and logged another call last month, and just today we closed it with the "workaround" being the only solution they can provide for now, because supposedly they have not heard of it before! I referred to this chat, that Nov 21 CU etc. but unfortunately both were brushed over with instead new focus efforts done to capture more logs and recreate/repair etc. :(

Anyway, the fix for us was to change each CNO account to Full Control on the respective CNO DNS records, and to do the same for any related VCO accounts. In some cases, we also did the same on the VCO records (added Full Control for the CNO and VCO objects themselves) which I am not entirely convinced if that was necessary or not.

As part of "archiving" the case, I asked Microsoft to update their official documentation to reflect this workaround is required to overcome Event 1257 intermittently nutting off when a node might be rebooted.
Please sign in to rate this answer.

1 person found this answer helpful.
James Edmonds 816 Reputation points

2022-11-23T12:27:55.73+00:00

Damn, that's a shame! Although, I cannot say I am surprised by this outcome given my limited experience with Microsoft support in the past, it can be a bit hit and miss for less superficial issues like this.

We will look at implementing the same manual workaround for now, as we don't have the ability to log a case ourselves.

Thanks for your efforts Steve!

yawe12323 1 Reputation point

2022-11-23T15:17:41.543+00:00

our "fix" was to change all of the DNS records to static and uncheck the dns check on each servers network connection since we did not want to give active directory objects special permissions since there is no official fix from microsoft we are left to make our own.

Steve Cahill 11 Reputation points

2022-11-23T19:57:01.213+00:00

Forgot to add, of course the cluster service needs a restart too. I did a few manually to validate this fix, then let monthly patching reboots sort the remaining...which it appears to have done successfully.

Steve Cahill 11 Reputation points

2022-11-23T19:58:15.3+00:00

Take Offline/Bring Online to be specific!
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
AL-TechAdmin 1 Reputation point

2022-04-24T19:55:21.907+00:00

James,

I have the same problem which I have not been able to resolve it yet.

This is what I noticed. I think we cannot simply delete CNO object which Microsoft recommends, or re-create it manually and grant it appropriate permissions (aka “Allow any authenticated user to update DNS record with the same owner name”).

On day 1, after manually creates it, the error didn't repeat but actually it didn't go away.
On day 7, the error 1257 ID of DNS came back.

The action above only creates the DNS record as STATIC (as opposed to Dynamic) with date/time stamp of last update by the virtual cluster server.

On another set of cluster servers (different DNS server, different domain), I noticed this CNO is a dynamically created name. Not static.

I wonder, how do we create this CNO object DNS name dynamically in the first place?
Please sign in to rate this answer.
James Edmonds 816 Reputation points

2022-04-25T09:37:11.353+00:00

Sounds like what I am seeing.

Interestingly, I don't see the issue at the moment, so will check back in a week to see if it reappears.
If it does, I will give details of the node, cluster and role names, along with ACLs on the DNS entries.

For ref, I think that a dynamic DNS entry is created when the cluster name resource is brought online (if a record doesn't already exist).
I think Microsoft say you can create a record manually, but dynamic should also work as far as I know. When the record is created, I would expect it to apply appropriate permissions on the record so the nodes/cluster name object can successfully update it.

Cheers
James

James Edmonds 816 Reputation points

2022-05-09T10:19:36.107+00:00

I've been off for a week, so don't know when it started, but my Cluster 02 (SQL cluster), has now started generating these event IDs again.
We did do a failover just before I went on holiday, so perhaps that is what triggered it.

I can see the cluster name shows as "DNS Operation Refused":

The owner node is currently server 04:

If I check the DNS record, it shows that the clustered SQL role name has basically full control over the record, but not the individual nodes:

This doesn't seem right to me, as I assume the nodes themselves need permissions on the entry, but given this record was automatically created by the cluster itself, I assume it must be what Microsoft intend?
I will try failing the role back to server 3, but otherwise my workaround is to delete the record, then offline/online the cluster name resource to have it recreate it.

It's a minor but frustrating issue, and I'd love to get it permanently resolved.
I want the cluster to be able to automatically manage this in the event of a name or IP change.

James Edmonds 816 Reputation points

2022-05-16T15:31:57.093+00:00

Since I got the cluster to recreate the DNS entry dynamically last week, it seems ok.
I will give it another few weeks and report back, but see no reason why this time it will work but not before.

Cheers
James

James Edmonds 816 Reputation points

2022-06-13T09:46:20.257+00:00

Not sure what I did differently this time during the manual deletion of the record, and automation of recreation by taking the cluster name offline and online, but here we are about a month later and seemingly not happening anymore!

WIll monitor for a while, but maybe some updates or something have fixed the issue.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Jack Dobiash 6 Reputation points

2022-06-03T20:50:21.227+00:00

Hey all, we are also experiencing this on two of our clusters. I'm pretty sure it's a bug that was introduced by a patch sometime near the start of the year. We have a 3rd cluster which has not had the the issue, but it hasn't been updated in a while. We are running Server 2016 Clusters. I know pretty much exactly WHAT the problem is, but not how to fix it. The issue occurs every time the CNO password gets updated, which occurs around every 21 days (at least on our system). Once the password has been updated by the 'core' node, it then somehow 'forgets' what that new password is, at least when trying to update the DNS registration. The underlaying cluster event logs even indicate that it's basically failing to login to the DNS server when attempting to update the DNS record. If we just move the core resources from one node to another (and even back to the first node), things start working again, until the next time it updates the password on the CNO. The other option is to take the 'Name' resource offline and bring it back online, that also fixes it until the next password change.

If you want to see when the last time your CNO password was updated, check the Attributes of the actual Cluster Object in AD and look for 'pwdLastSet'. In our case, it's been like clockwork each time the password is updated on both clusters. Within 24 hours of the update it starts to complain again (since it updates DNS once a day).

I'm hoping may someone else can confirm they are in the same situation? Thanks!
Please sign in to rate this answer.
James Edmonds 816 Reputation points

2022-06-13T09:49:23.84+00:00

Interesting!

I've just come back from a week's holiday, and don't see any errors logged at the moment.
The last time I manually removed the record, and allowed the cluster to recreate, was about a month ago.
I see our pwdLastSet for both clusters is the 7th and 11th (this is the same for both the CNO and the role Name Objects).

I don't think I've completed any CAU updates since I last had the issue, and I am running server 2022, but maybe some update had broken it and a newer one has fixed it.

I'll continue monitoring and feedback if my issue reappears.

Jack Dobiash 6 Reputation points

2022-06-13T17:39:40.72+00:00

Thanks for the reply James! I get the feeling that your issue is probably different than ours, so hopefully yours is fixed. Ours definitely isn't as we had to just had to 'offline/online' our 'Name' Role again to get it working. I don't think deleting the DNS record in our case will fix it, BUT as a last ditch effort we'll give it a try the next time it occurs. Of course it takes 21 days each time to find out if it worked :)

Take care!

Joe Bruns 21 Reputation points

2022-07-01T17:59:23.493+00:00

Following

Joe Bruns 21 Reputation points

2022-07-01T18:39:58.197+00:00

Ours started at 4/24/2022 8:20:51 PM, so the bug had to be introduced in April 2022 cumulative maintenance. Wonder if anyone has opened a case with MS yet?

James Edmonds 816 Reputation points

2022-07-01T18:47:58.187+00:00

Sadly we aren't able to do so. Hopefully others will start noticing similar and post in here, and we can get a Microsoft rep to report it internally for investigation.

I don't think it's anything we are doing if at least 3 or 4 of us so far have seen this behaviour!

Cheers
James

Jack Dobiash 6 Reputation points

2022-07-01T18:54:42.563+00:00

I suspect, at least for Windows 2016, that it was in the March 2022 updates - That is when it started happening for us. James has been having his issue even longer, but I think his is possibly a separate, though similar, issue from mine. My issue is very specifically related to the password changing on the CNO object causing it. I literally know exactly when it will occur each time, as the CNO password changes every 22 days on our system.

James Edmonds 816 Reputation points

2022-07-01T19:00:42.047+00:00

I haven't yet been able to confirm if mine is occurring at password expiry time, but will advise if the next time it happens, the timelines coincide with the CNO password expiry date/time.
As I'm a one man band, it's difficult for me to dedicate enough time to look for it regularly, so sometimes I might only spot it days after it starts happening.

Note, I'm using server 2022, so might not be OS specific.

Cheers and have a good weekend.
James

Joe Bruns 21 Reputation points

2022-07-01T19:24:31.907+00:00

We patch monthly and it (1257's) didn't show up until 4/24/22, so it had to be April maintenance. We already had March installed and nothing.

Jack Dobiash 6 Reputation points

2022-07-01T19:51:26.337+00:00

We never applied the April updates due to some other reported issues, so for us at least, it was the March updates :) But really I guess it doesn't matter - We can agree that something happened around that timeframe to introduce this issue. Is yours Server 2016 as well?

Joe Bruns 21 Reputation points

2022-07-01T20:38:46.03+00:00

Yep. I will have to get around to opening a case. I will reply back with the number if you are interested.

Jack Dobiash 6 Reputation points

2022-07-01T20:50:35.517+00:00

Absolutely!

James Edmonds 816 Reputation points

2022-07-01T21:09:23.437+00:00

Yes please.

Joe Bruns 21 Reputation points

2022-07-01T22:31:01.783+00:00

2207010030001944

You won't be able to see the contents but if and when a hotfix is issued, I will reply with that.

Plotnikov Sergey 1 Reputation point

2022-07-16T06:17:04.863+00:00

Following too

Joe Bruns 21 Reputation points

2022-07-18T14:47:11.93+00:00

MS is still researching where the bug was introduced. I think I have convinced them it is on their side since we do not do manual DNS creations for clusters.

They said they have to go back to November of last year since 2016 maintenance is a little different than more modern OS's.

Once I have an answer. I will post back.

James Edmonds 816 Reputation points

2022-07-18T15:51:58.093+00:00

Thanks for your efforts!
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Hutton, Gregory 1 Reputation point

2022-07-18T14:37:03.593+00:00

Following as well
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Failover Cluster DNS error, event 1257 keeps coming back

15 answers

Your answer