SQL 2016 - It Just Runs Faster: Dynamic Memory Object (CMemThread) Partitioning

The CMemThread waits (PWAIT_MEMTHREAD) can be a point of contention as machine sizes advance. The CMemThread object type is utilized in 100s of objects throughout the SQL Server code base and can be partitioned globally, by node or by cpu.

 

The vast majority of CMemThread objects leverage global partitioning. Trace flag -T8048 only forces node based partitioning to cpu based partitioning. Each time a highly contended, global partitioned object was encountered a hotfix was required to partition the object.

 

SQL Server 2005, 2012 and 2014 contains dynamic latch promotion and demotion logic (Super/Sub-Latch). The concept is to watch the number of acquires on a latch along comparing sampling rates as to how long it should take to acquire a latch without contention. When a shared latch can be promoted (partitioned) in order to improve performance and scalability SQL Server will do so.

 

Taking a page from the latch design concepts, CMemThread is designed in SQL Server 2016 to dynamically promote a contended CMemThread object. SQL Server 2016 detects contention points on a specific CMemThread object and promotes the object to a per node or per cpu based implementation. Once promoted the object remains promoted until the SQL Server instance is restarted.

 

The dynamic CMemThread design is a superset of -T8048 behavior, allowing you to deploy SQL Server 2016 on a wide variety of hardware environments and SQL Server 2016 automatically adjusts CMemThread behavior to the applicable load. By partitioning contended CMemThread objects it removes the waits behaviors and allows SQL Server 2016 to scale to your applications needs.

 

clip_image001[4]

 

The sys.dm_os_memory_objects DMV has been extended to include additional contention information which can be used to determine the promotion level along with new XEvents provided to monitor the activity.

 

'It Just Runs Faster' - Apply SQL Server 2016 and SQL Server dynamically partitions a CMemThread encountering contention, increasing scalability of the instance.

 

Ajay Jagannathan - Principal SQL Server Program Manager

Benjamin Satzger - SQL Server Software Engineer

Bob Dorr - Principal SQL Server Software Engineer

 

DEMO - It Just Runs Faster: Dynamic Memory Object Partitioning

Overview

Creating a demonstration for a thread synchronization object, such as CMemThread, is difficult. Not only are there 1000s of memory objects (sys.dm_os_memory_objects) but if we find a hot spot the development team fixes it. The other issue with such a demonstration is machine variance. To create such a hot spot requires tight collisions across multiple CPUs.

 

Actual Scenarios

SQL Server 2016 has been vetted by a wide range of customers. The positive impact of these changes has been realized by:
 

  • EVERY Customer gets the advantage of the CMemThread improvements if any one of the 1000s of objects encounters contention
  • Shipping customer was able to increase overall batch rate significantly
  • Web Hosting Service improved web page response times by a third

 

Video

 

Demonstration Transcript

The associated video shows the dramatic impact of dynamic, CMemThread partitioning.

 

The demonstration leverages a custom, extended procedure (xp_alloc) and the srv_alloc / srv_free ODS APIa. The MEMOBJ_XP object is global by default and when heavily used by this stress scenario shows the dramatic impact of dynamic, CMemThread partitioning.

 

The routine accepts the number of loops and size of allocation and uses srv_alloc and srv_free to access the MEMOBJ_XP memory object.

 

SRV_RESULT xp_alloc(SRVPROC* pSrvProc)
{

        .. get parameter values Loops and Size
 

for(int iLoop = iLoops; iLoop > 0; iLoop--)

{

    BYTE * pData = srv_alloc(iSize);
if(nullptr != pData)
srv_free(pData);

}

}

 

The stress uses the RML, Ostress utility to execute 48 threads, repeating 300000 times, 100 loops of 1 byte allocations to maximize pressure on the MEMOBJ_XP, CMemThread object.

 

ostress.exe -S.\SQLPMO -E -d"master" -Q"execute xp_alloc 100, 1" -n48 -r300000 -b -q

 

Shows xp_alloc is installed on the system

Uses the dm_os_nodes and dm_os_schedulers DMV to show the large number of CPUs on the demonstration system.Note: A CMemThread does not require lots of CPUs to become hot. It is more likely to be the case but lots of activity on a few CPUs can cause bottlenecks based on the query pattern and usage pattern of the CMemThread object.

Utilizes dm_os_memory_objects to display the MEMOBJ_XP details before stress is applied

Clears statistics so the behavior is easy to observe

Creates and starts an XEvent session to capture promotion activities

Enables performance counters to show the contention and batch rate behavior changes

Starts stress of the MEMOBJ_XP object as outlined above

Revisits the DMV information showing the contention

Resets the wait statistics and shows the contention has been removed from the system

Points to the partition_type change (3) in dm_os_memory_objects indicating CPU level partitioning has been achieved

Highlights the performance monitor capture, clearly showing the contention change and batch rate more than doubling. 

In tests done in a lab environment, the performance monitor chart below shows up to 3x improvement in throughput from 3000 batch requests/sec to 8000 batch requests/sec (green line) and 60x reduction in waits from an average of 57000 waits/sec to 0 waits/sec (red line).

clip_image001[6]

 

 

Opens the captured XEL file showing the dynamic partitioning took the MEMOBJ_XP to node partitioning, still saw contention and prompted to CPU level partitioning.