Very disappointed with SQL Server/Microsoft

Carlos D 51 Reputation points
2022-12-13T19:19:09.22+00:00

Facts:

  • 2 Node SQL Server 2019 - Windows 2019 Always On cluster.
  • 96 GB RAM - 8 CPU Cores
  • 185 Databases in 1 Always On group with a Listener

Action: Scheduled Failover to replica node from primary during business hours.

  • Around 2200 connections at the start of the failure.
  • Most of the 185 databases use Full-Text indexing
  • Half or more databases have "Memory optimized" tables
  • The databases are all in "Synchronized" mode and the Always-On group is in "Health" state.
  • Initiated the failover from the Always-On properties/options

Result of the failover:

  • The failover GUI indicated a successful failover.
  • Only very few databases were synchronized.
  • The replica went into "Resolving" mode.
  • Both servers LOCKED-UP and CPU went to 100% Not allowing to click or open anything.
  • If you logout of the both nodes, you can no longer log back in. Completely frozen and the time is ticking.

Actions to resolve:

  • Reboot servers one at a time - Did not help.
  • Increase the CPU cores from 8 to 32 - Did not help.
  • Failback the Always-On to the original node -Did not help.
  • Stop SQL Server services on replica node - Did not help.
  • Just sit and wait until the servers (At least one of them) become responsive again.
  • After 6-7 hours, 140 of the 185 database were synchronized and 45 still pending or some other mode
  • Suspend data movement for the 45 databases.
  • Removed 45 databases from the Always-On group
  • Suspend data movement on the rest 145 databases.
  • Removed the rest 145 databases from the Always-On group.
  • Restore to point-in-time 6-7 databases that were in "Suspect" or "Restoring" mode
  • Now the server is like a standalone with no connection to Always_on
  • And yes, went through all the error logs both SQL Server and cluster.

The customer lost 6-7 hours of business hours who knows how much it cost them besides their reputation.

The servers were running fine with 8 CPU Cores before the failover and had 0 issues. The only cause I can think of it that the servers ran out of worker threads and the synchronization was waiting for worker threads to be released and sent into the thread pool to finish on the synchronization of other databases that were not synchronized. Not sure why failover may need too many worker threads or extra CPU power to complete the failover process in a time (NOT 6 HOURS) that is acceptable.

Went through this and NONE of them applies:

https://learn.microsoft.com/en-us/troubleshoot/sql/availability-groups/troubleshooting-automatic-failover-problems

Conclusion: No more suggestion to customer to use SQL Server. Pick any other DBMS. Never use Always-On on VMs or don't use Always-On at all. Use traditional clustering. Training for Oracle/MySQL

Lost the customer.

SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
14,369 questions
{count} votes

5 answers

Sort by: Most helpful
  1. Bruce (SqlWork.com) 69,976 Reputation points
    2022-12-13T20:02:56.317+00:00

    you memory-optimized tables require fast ssd. sqlserver is a scale up technology so generally has faster performance on like hardware. its clusters are really high availability. you should configure for faster failover (at a cost of performance)

    https://learn.microsoft.com/en-us/sql/relational-databases/logs/database-checkpoints-sql-server?view=sql-server-ver16#IndirectChkpt

    you can switch to oracle which uses true cluster technology and thus faster failover. oracles uses a distributed cache for memory columns, so when a server drop out of the cluster, the cache is lost. you might want to mirror the caches. of course its a more expensive solution.

    0 comments No comments

  2. Bjoern Peters 8,896 Reputation points
    2022-12-14T01:08:07.83+00:00

    Hi @Carlos D

    Sorry that you had such a bad experience with that SQL Server from your customer; as Erland stated: " Was this procedure tested ever since before? In a TEST environment?
    I also have several questions regarding that solution and/or process... trying to understand your problem better or tell you what went wrong and what you (or your customers) could be doing better next time...

    If this was a scheduled (aka planned) failover, why not pause the data movement to reduce the load

    And you didn't tell us about those using applications. Had they been shut down, or how they behaved during the failover?
    In my personal stomach... 96GB RAM for ~100 memory-optimized databases is slightly too low => the wrong sizing

    You wrote something about "Virtual machines"...
    which kind of virtualization?
    How is the storage connected?
    How are the VMs configured?
    How does the Hypervisor behave during the failover?
    Are there any numbers or alerts during the failover?

    Could you eventually share the error log right after the failover?

    0 comments No comments

  3. YufeiShao-msft 7,131 Reputation points
    2022-12-14T09:30:41.937+00:00

    Hi @Carlos D ,

    Some recommendations:

    When failover not working and secondary replica stuck in resolving state, you can check out your error log and cluster log, maybe you can get some related event in failover cluster manager, you can try to change your failover threshold, you can increase this value as per your need

    When you do failover, you can choose to test this in a sandbox environment first

    -------------

    If the answer is the right solution, please click "Accept Answer" and kindly upvote it. If you have extra questions about this answer, please click "Comment".

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  4. Carlos D 51 Reputation points
    2022-12-14T13:47:10.427+00:00

    All have been looked at. What I missed on the post:

    • Yes, this was tested before and it all worked out at the beginning when we migrated the customer 2-3 years ago with still around 200 databases. The customer don't have the luxury of having a test environment and test is often nor a time for maintenance as their customers run call centers.
    • The database sizes are not too large with the largest around 65GB, the second 41GB and third 17GB and the rest is less then 10GB.
    • Yes, there are cluster errors but the cluster errors caused by the servers hitting the 100% CPU and becoming unresponsive, hence throwing out the cluster. Cluster setup as 2 nodes with file share quorum which all works in normal circumstances.
    • Tried all related AG and cluster related settings including failover treshold (As this happened one more time before) but no difference.

    Conclusion is SQL Server is not a software for more than 50-60 databases per AG/instance.


  5. Erland Sommarskog 115.9K Reputation points MVP
    2022-12-14T22:10:42.917+00:00

    Yes, this was tested before and it all worked out at the beginning when we migrated the customer 2-3 years ago with still around 200 databases. The customer don't have the luxury of having a test environment and test is often nor a time for maintenance as their customers run call centers.

    I can understand that there are budget constraints, but if you have a critical business where downtime is not permissible, having a test environment for testing advanced operations is essential. In the end it is a matter of "how much does that extra environment" vs. "how much does a disaster cost us".

    Even if you had tested this and felt confident with it, I'm still a bit surprised by doing this during business hours... Then again, if there is something like a hardware failure, you want a failover to happen. And in a lot less than 6-7 hours.

    Conclusion is SQL Server is not a software for more than 50-60 databases per AG/instance.

    As I indicated, 185 databases in an AG is quite excessive. I googled around and I found this document where it says

    he actual number of databases and availability groups you can put on a computer (VM or physical) depends on the hardware and workload, but there is no enforced limit. Microsoft has tested up to 10 AGs and 100 DBs per physical machine, however this is not a binding limit

    If I recall correctly (I'm more of a developer and not an HA expert), each replica needs a worker thread, so only 8 cores sounds a bit underprovisioned to me. The size of the databases does not really matter.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.