How Can Reference Counting Be A Leading Memory Scribbler Cause?
The concept of the memory scribbler comes up quite a bit in support. The term can often be over used but I ran into a specific example that commonly fools people, including support engineers. The random nature and even the resulting behaviors are so broad that these issues often take quite a bit of troubleshooting to determine cause.
Hint: Repro is the fastest way to get to the solution.
What Is A Scribbler?
A scribbler is defined as an action taken by code that results in memory that is does not own being changed.
The definition is pretty techy so I like to describe this as coloring outside the lines. The code is to run exactly as intended but a scribbler bug causes the logic to step outside the expected behavior.
The picture I included clearly shows a drawing of a crab, but there are some locations where the color comes outside the shell - scribbled.
Let's look at a simple example. You would not intentionally write code this way but it only takes 3 commands to show you what an un-owned memory scribble looks like.
BYTE * pData = new BYTE[10]; // Allocates memory
delete [] pData; // Releases memory
// code path no longer OWNS the memory as it has been returned to them memory manager
memcpy(pData, 'X', 10); // Scribbles on the memory is does not own
Visualized this would look like a heap with free memory regions.
An allocation removes an entry from the free list and assigns it to the caller (owner of the memory.)
The delete returns the memory to the heap free list but the local variable still points to the memory address. The value assigned to pData should no longer be used because the memory has been returned to the memory manager and will be assigned or released to the operating system. At this juncture pData should assume (NO ACCESS.)
A second allocation request occurs and the memory is handed out to pDataANOTHER. The pData was not set to NULL so we have two variables that point to the same memory address. The pData should be considered (stale/old) and not reused.
Now the code mistakenly has logic that results in the use of pData after it was released to the memory manager (NOT OWNED by pData). The example shows that memset scribbling (sometimes called stomping) on the memory that pDataANOTHER is the owner of.
Because this type of bug can scribble any number of bits at any random location the behavior can be anything from exceptions, invalid results, to unexpected behavior just to mention a few. This is why it is imperative to break such a problem down to a reproducible scenario allowing the exact steps that cause the bug to be studied and corrected.
What Is A Bit Flip?
When working with scribblers the term 'bit flip' often comes up. A bit flip is when a single bit appears to have been changed. For example 2 and 3 are the following binary representation.
- 2 = 0010
- 3 = 0011
The difference between them is a single bit change, or as if a single bit is flipped from 0 to 1 or 1 to 0. Hence the term bit flipped.
Most people think of this as a stale pointer and bitwise operation. Going back to the pData and pDataANOTHER assume the pDataANOTHER = 0x2 and the following code was executed.
pData[0] |= 1;
This changes the value in pDataANOTHER to 0x3 (from 0010 to 0011) and appear as a single bit flip.
Stale ~= Reference Counting
In practice I don't find the single bit flip as common (don't get me wrong it can and does happen) I instead find the behavior looks like a bit flip but the issue arises from a reference counting operation.
Let's advance the example a bit and build the following two classes in our code.
MyOldClass (pOld) has a reference counter located at the start of the class and MyNewClass (pNew) has a status value in that same offset location within each class. The AddRef and Release performs InterlockedIncrement (++) and InterlockedDecrement (--) operations on the reference count member.
Assume the same type of activity has taken place and pOld memory was released back to the memory manager but a bug allows pOld->Release() to be called after the memory was released. The pNew allocation has already reused the memory address so where the m_dwStatus physically resides is where pOld thinks the m_dwRefCount is also located.
The logic in ::Release() is to decrement the value located in in m_dwRefCount. I.E. InterlockedDecrement(&m_dwRefCount) or m_dwRefCount--; Since we are only subtracting (-1) it can look like a bit flip or a damaged byte in memory.
- 0x03 (0011) and a -1 results in 0x2 (0010) -- Looking like a bit flip
- 0x0 (0000) and a -1 can result in 0xF (1111) -- Multiple bits are changed
What's The Big Deal?
It is obvious to you that this is not the desired behavior and it needs to be corrected to provide proper stability to the application. However, there are other side effects that you may not have considered.
What if this is SQL Server memory management and the stale pointer ends up pointing back to a data page because the memory page was reused to support a data page when the pointer was released to the memory manager? Now the scribble behavior can happen on the actual data stored on the page. If this impacts the row tracking structures you can see corruption issues reported by DBCC or at runtime but it the scribble impacts the actual data storage bytes you may not notice this until you customer complains their name is not spelled correctly or that $500 they deposited in the account has become $499 and their statement won't balance.
Security: More concerning is when the problem can be used for a security exploit. There are various ways that the behavior might be susceptible to an exploit. If you really are interested look up 'Heap Exploit and Heap Spraying' on Bing. This is why every exception reported to Microsoft is checked by our security teams for exploitable possibilities.
One way is to take the class example and extend it to include a virtual method so the class contains a VTABLE. Now the pOld overwrite can change the VTABLE pointer. If the overwrite action can be modified by the user they could potentially point the VTABLE functions to some code they want to execute and not the proper code to be executed you have a security exploit.
How Are You Protected?
There is not such a thing as 100% protection but Microsoft products go to great length to make sure this does not happen. In fact, you can read about the extended heap protection the operating system provides out of the box to help prevent exploits from any application. (https://blogs.msdn.com/b/b8/archive/2011/09/15/protecting-you-from-malware.aspx)
Our policy is that any heap or memory manager must attempt to protect itself against such an attach. Thus, anytime the internal structures of the memory manager are compromised it is a requirement that the process be terminated.
For SQL Server you may see the (ex_terminator) handler. SQL Server installs the termination, structured exception handler, around all memory manager activities (I.E, Alloc, Free, ….). If any exception or assertion by the code fails the termination handler is used to capture a dump (Using SQLDumper as the external process so we are not using the compromised process of SQLServer.exe) and SQL Server is terminated.
You can also reference the following SQL Server protection mechanisms that can help locate a possible scribbler source.
- Database Checksum
- Backup Checksum
- Database Constant Page Protection (https://support.microsoft.com/kb/2015759)
- Extended Page Heap Activities: -T3654 and -T8809
- Latch Enforcement Activities: -T815
Note: Trace flags should be used with caution and under the guidance of Microsoft.
Bob Dorr - Principal SQL Server Escalation Engineer