How It Works: SQL Server 2012 Database Engine Task Scheduling
Over the years the SQL Server scheduling algorithms have been documented in various publications. Specifically, ‘The Guru’s Guide to SQL Server Architecture and Internals’ has a chapter, written by the scheduler developer (Sameer) and Ken; and I reviewed the technical content, covering the details.
This post outlines a few of the changes that were made in SQL Server 2012. The post is not intended to cover all the nuances(there are far too many), instead I will be highlighting a portion of the new algorithm so you understand the SQL Server behavior. I also take several liberties with the description of the algorithm so this post does not turn into a whitepaper.
Algorithm Recap
Scheduling assignment starts at the NUMA node level. The basic algorithm is a round robin assignment for new connections. As each NEW connection arrives it is assigned to a scheduler based on round robin, NUMA node connection assignment.
Note: SQL Server Books Online outlines how to associate listeners with specific NUMA nodes.
A new connection is assigned to the scheduler with the smallest load factor within the same NUMA node. Load factor is loosely equivalent to the number of tasks assigned to the scheduler. You can view the scheduler information in the DMV (sys.dm_os_schedulers.) The scheduler choice becomes the preferred (or hint) scheduler for the life of the connection.
The task assignment boundary is when the client submits a new command (Batch, RPC, etc…) The associated task in SQL Server is assigned to a scheduler and the task remains associated with the scheduler for the lifetime of the command (until the batch completes.)
The task assignment is also based on load factor. If the preferred scheduler (after connection) has 20% more load than the other schedulers, for the same NUMA node, the task is assigned to the scheduler with the least load on the same NUMA node. This is often referred to at the 120% rule. Once your preferred scheduler achieves 120% more load than the other schedulers, on the same NUMA node, new tasks are assigned to other schedulers within the same NUMA node.
SQL Server 2012 Changes – EE SKU Only
Out of the gate, SQL Server 2012 does not make any changes. What - Why did I write the blog? Am I just wasting your time? No, I say this because all SKUs other then Enterprise Edition use the same, fundamental logic that was introduced in SQL Server 7.0 (UMS scheduling) and as described above. The EE SKU has been updated to further accommodate CPU resource governance. If, and only if, you are running the SQL Server Enterprise Edition SKU do these new changes apply.
New Connection Assignments: Assigned round-robin to the nodes. 1st to Node 1, 2nd to Node 2, 3rd to Node 1 and so forth. The core algorithms for connection assignment remains the same in SQL Server 2012, all SKUs. As you can see from the diagram Port 1433 is bound to both NUMA nodes so the round-robin will occur between nodes. The load of the schedulers within the target node is queried and the new connection is assigned to the scheduler with the least load. This scheduler becomes the preferred scheduler for the life of the connection.
A new connection does not have a preferred scheduler. This means the scheduler assignment queries the schedulers, within the target node, and the scheduler with the least load is used. In the example Session 1 is assigned to Node 1, Scheduler 2. This algorithm is unchanged across all SQL Server 2012 SKUs. The connection has not been assigned to a pool yet so the initial logic to assign the connection task has to work from a basic load factor of the schedulers.
Least Load – After Connection Made and Preferred Scheduler Assigned
Prior to SQL 2012 or SQL Server 2012 NON-EE SKUs
The new task request (Batch, RPC, Disconnect, …) uses the preferred scheduler assignment. In our example the preferred scheduler is Scheduler #2. If the current scheduler hint (preferred scheduler) has exceeded a load factor of 120% of load on other schedulers within the same node ,the task is assigned to a different scheduler. The preferred scheduler remains the same for the connection.
In this example the new batch arrives. The preferred scheduler is Scheduler #2 but the load factor has increased to 13 with is > 120% of 10. So the batch would be assigned to Scheduler #1 because it is deemed to have less load and is likely to provide more CPU resources for the task.
SQL Server 2012 EE SKU
Starting with SQL Server 2012 EE SKU the behavior changes for a new task (Batch, RPC, Disconnect, …) only after the connection has been established and pool assignment made. (No connection no preferred scheduler yet! )
Note: Even if the RG is disabled the DEFAULT pool is used internally.
The process of assigning a new task to a specific scheduler is where the SQL Server 2012 EE SKU logic was upgraded from the basic, load factor algorithm. The idea is to find a scheduler that can provide the best scheduler, CPU capacity for the pool the connection is assigned to, within the same NUMA node while minimizing the overall, scheduler assignment randomization.
Each scheduler has per resource pool CPU targeting and the associated tracking capabilities along with the traditional load factor tracking. Instead of using the traditional, task count based load factor a CPU resource targeting load factor is used for the task assignments. In fact, SQL Server tracks the average CPU targeting per resource pool, per node.
The scheduler assignment starts with the preferred scheduler (hinted scheduler for the connection.) As long as adding the new task to the preferred scheduler does not drop the targeted resource pool consumption capabilities below 80% of the average for the pool across all schedulers, on the same NUMA node, the preferred scheduler is used as the target.
Once adding another task to the preferred scheduler, within the same resource pool causes the targeted resources to fall below 80% of the average the algorithm will then select the scheduler with the most targeted resources available for the pool.
Let’s do a mathematical example to show this in action, at a high level.
Note: The math included here is to show the basics and is not the exact algorithm.
In this example, tasks with the most resources available for the pool would currently be assigned to Scheduler #2. On average scheduler #2 can currently provide 6.25 (resources) per task vs the 5 (resources) on Scheduler #1 for the same pool. In very simple terms the 10 tasks on Scheduler #1 each get 5% of this CPU and the 8 tasks on Scheduler #2 get 6.25% of the CPU resources.
Scheduler | RG Pool Target | Pool Runnable Tasks | Avg Pool/Task | |
1 | 50 | 10 | 5 | |
2 | 50 | 8 | 6.25 | Currently Best Target – More resources to provide for tasks in the same pool |
Current Avg: 5.625 = (6.25 + 5 / 2)
Now, assume I have a new task to assign and the preferred scheduler is Scheduler #1. SQL Server 2012 EE SKU uses the current average and increased resource usage to help determine the assignment.
80th percentile average = 4.5008 (Average resource units a task within the same pool achieves across all schedulers of the same node.)
Scheduler | RG Pool Target | Pool Runnable Tasks +1 Additional | Avg Pool/Task +1 | |
1 | 50 | 11 | 4.5454 | Not below 80th percentile |
2 | 50 | 9 | 5.55 |
Adding another task, to the same pool on Scheduler #1 does not drop the average below the 80th percentile for the entire NUMA node. The assignment of this request will still occur on Scheduler #1.
Current Avg Adjusted: 5.3977 (6.25 + 4.5454/2) and 80th percentile = 4.3181
Now lets attempt to add a task to Scheduler #1 again. This time assigning the task to Scheduler #1 would drop below the 80th percentile average. SQL Server will instead find the scheduler with the most resources for the pool and assign the task. In our example this is Scheduler #2.
Scheduler | RG Pool Target | Pool Runnable Tasks +1 Additional | Avg Pool/Task +1 | |
1 | 50 | 12 | 4.1666 | Below 80th percentile |
2 | 50 | 9 | 5.55 |
The adjusted load would then become the following.
Scheduler | RG Pool Target | Pool Runnable Tasks | Avg Pool/Task | |
1 | 50 | 11 | 4.5454 | |
2 | 50 | 9 | 5.55 | Added task |
Current Avg Adjusted: 5.047 (5.55 + 4.5454 / 2)
Overall the assignments of 2 new tasks within the same resource pool and NUMA node did not cause the targeted resources for the pool to vary significantly and maintained the goal of scheduler assignment locality and reduced movement.
TRACE FLAGS USE WITH CAUTION
I considered not documenting the trace flags as I have never seen us use them for a production system. There are some trace flags to control the behavior. As always, these are not intended for extended use and should only be used under the guidance of Microsoft SQL Server support. Because these change the task assignment algorithm they can impact the concurrency and performance of your system.
-T8008 - Force the scheduler hint to be ignored. Always assign to the scheduler with the least load (pool based on SQL 2012 EE SKU or Load Factor for previous versions and SKUs.)
-T8016 - Force load balancing to be ignored. Always assign to the preferred scheduler.
Bob Dorr - Principal SQL Server Escalation Engineer