Chapter Two, Assemnly code, Exception, Memory, Synchronization and Debugger
Chapter Two, Assemnly code, Exception, Memory, Synchronization and Debugger
(To save time, I will not translate the full Chinese content. I will skip some knowledge introduction)
This chapter covers usermode debugging related knowledge and tools, including assembly language, exception, memory layout, CRT, handle/CriticalSection/Thread Switch/Windbg…
It focuses on how to use the knowledge and tools for effective debugging. For the knowledge itself, the following two books cover them in detail:
Programming Applications for Microsoft Windows
Debugging Applications for Windows
The following sections use windbg to demo the debugging. If you want to setup windbg to follow the demo, please go to Section 4, which discusses windbg.
Section 1:Assembly code, the minimal unit for CPU execution
Assembly code is the minimal unit for CPU execution. In following situations, assembly code analysis is necessary:
1. There is no problem in uplevel code, like C++ and C#. However, the execution result is incorrect. The compiler and CPU are suspicious.
2. When the source code is out of hand. Like a Windows API behaves strangely.
3. When application crashes, like Access Violation. The first hand information in debugger is the assembly op code that accesses memory.
Intel Architecture Manual volume 1,2,3
https://www.intel.com/design/pentium4/manuals/index_new.htm
Case Study, Analysis the VC compiler optimization
Problem Description:
The customer is developing a performance sensitive application. He wants to know how the VC compiler optimizes the following code:
nt hgt=4;
int wid=7;
for (i=0; i<hgt; i++)
for (j=0; j<wid; j++)
A[i*wid+j] = exp(-(i*i+j*j));
The direct way is to check the binary code generated by the compiler. Here is my analysis. You may check on your side first, and compare with mine.
My analysis:
My analysis is based on VC6, default release mode settings. (The customer was using VC6 at that time. Now VC6 is out of the product life cycle and Microsoft does not support it any more).
int hgt=4;
int wid=7;
24: for (i=0; i<hgt; i++)
0040107A xor ebp,ebp
0040107C lea edi,[esp+10h]
25: for (j=0; j<wid; j++)
26: A[i*wid+j] = exp(-(i*i+j*j));
00401080 mov ebx,ebp
00401082 xor esi,esi
// The result of i*i is saved in ebx
00401084 imul ebx,ebp
00401087 mov eax,esi
// Only one imul occurs in every inner loop (j*j)
00401089 imul eax,esi
// Use the saved i*i in ebx directly. !!Optimized!!
0040108C add eax,ebx
0040108E neg eax
00401090 push eax
00401091 call @ILT+0(exp) (00401005)
00401096 add esp,4
// Save the result back to A[]. The addr of current offset in A[] is saved in edi
00401099 mov dword ptr [edi],eax
0040109B inc esi
// Simply add edi by 4. Does not calculate with i*wid. Imul is never used. !!Optimized!!
0040109C add edi,4
0040109F cmp esi,7
004010A2 jl main+17h (00401087)
004010A4 inc ebp
004010A5 cmp ebp,4
004010A8 jl main+10h (00401080)
The optimization is:
1. Since i*i result does not change in inner loop, the compiler caches the result of i*i in ebx register.
2. For A[i*wid+j] data fetch, only j changes in inner loop, and j increases one each time. Since A is an int array, the next fetch address increases 1*sizeof(int), which is 4. The compiler caches i*wid+j into EDI, and uses “add edi,4” to optimize the fetch address calculation.
3. The compiler saves temp variable in register to avoid memory access.
Can you do a better job than the compiler by writing the assembly code manually?
Case Study: VC2003 compiler’s bug. C++ application works in debug, crashes in release
The compiler’s bug does exist. If you test the following code in VS2003, it crashes in release mode:
// The following code crashes/abnormal in release build when "whole program optimizations /GL"
// is set. The bug is fixed in VS2005
#include <string>
#pragma warning( push )
#pragma warning( disable : 4702 ) // unreachable code in <vector>
#include <vector>
#pragma warning( pop )
#include <algorithm>
#include <iostream>
//vcsig
// T = float, U = std::cstring
template <typename T, typename U> T func_template( const U & u )
{
std::cout<<u<<std::endl;
const char* str=u.c_str();
printf(str);
return static_cast<T>(0);
}
void crash_in_release()
{
std::vector<std::string> vStr;
vStr.push_back("1.0");
vStr.push_back("0.0");
vStr.push_back("4.4");
std::vector<float> vDest( vStr.size(), 0.0 );
std::vector<std::string>::iterator _First=vStr.begin();
std::vector<std::string>::iterator _Last=vStr.end();
std::vector<float>::iterator _Dest=vDest.begin();
std::transform( _First,_Last,_Dest, func_template<float,std::string> );
_First=vStr.begin();
_Last=vStr.end();
_Dest=vDest.begin();
for (; _First != _Last; ++_First, ++_Dest)
*_Dest = func_template<float,std::string>(*_First);
}
int main(int, char*)
{
getchar();
crash_in_release();
return 0;
}
The compiler settings are:
1. Disable precompiler header.
2. Use /O2 /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /GS /Fo"Release/" /Fd"Release/vc70.pdb" /W4 /nologo /c /Wp64 /Zi /TP
Trace assembly execution:
Based on bellow initial analysis, the problem is likely a compiler bug:
1. Besides cout and printf, there is no system related API call any more. All the operations are based on registers and memory. The environment and system factor are not involved here.
2. The code looks good. If we adjust the place of std::transform, to put it behind the for loop, the problem does not occur any more.
3. The compiler setting matters.
In the code, both std::transfor and the for loop do the same vector translation by calling func_template. Comparing the difference may help. I set breakpoint in the entry of main, and use ctrl_alt_D to go into assembly code mode. Trace the execution step by step, found:
In the implementation of STL in VS2003, the std::transform uses the following code to invoke the translator function:
*_Dest = _Func(*_First);
The compiler handles the call with the following:
EAX = 0012FEA8 EBX = 0037138C ECX = 003712BC EDX = 00371338 ESI = 00371338 EDI = 003712B0 EIP = 00402228 ESP = 0012FE70 EBP = 0012FEA8 EFL = 00000297
388: *_Dest = _Func(*_First);
00402228 push esi
00402229 call dword ptr [esp+28h]
0040222D fstp dword ptr [edi]
The parameter is saved in ESI, which is used to call func_template. As above, std::transform passes the parameter to func_template by pushing into stack.
While in the for loop, the compiler handles the
*_Dest = func_template<float,std::string>(*_First);
With the following
EAX = 003712B0 EBX = 00371338 ECX = 003712BC EDX = 00000000 ESI = 00371338 EDI = 0037138C EIP = 00401242 ESP = 0012FE98 EBP = 003712B0 EFL = 00000297
37: *_Dest = func_template<float,std::string>(*_First);
00401240 mov ebx,esi
00401242 call func_template <float,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > (4021A0h)
00401247 fstp dword ptr [ebp]
As above, in for loop, the parameter uses mov opcode to save the parameter into ebx, and then passes to func_template.
At last, let’s check how func_template handles the parameter passed in:
004021A0 push esi
004021A1 push edi
16: std::cout<<u<<std::endl;
004021A2 push ebx
004021A3 push offset std::cout (414170h)
004021A8 call std::operator<<<char,std::char_traits<char>,std::allocator<char> > (402280h)
Here it pushes ebx into stack, and then invoke std::cout, without reading the stack for parameter passed in. It means func_template, the callee believes the parameter should be passed from register. However, in transform function, the caller, the parapeter is passed by stack. This mismatched handling causes the crashes.
But why the problem only occurs in release mode, not debug mode? You can analysis by the same way.
Case Study, how the notorious DLL Hell causes Server Unavailable for ASP.NET
The ASP.NET web site reports Server Unavailable for any page request. When the w3wp.exe host process starts, it crashes immediately when handling the request. With further analysis in debugger, the source for the crashes is a null pointer access. However, from the callstack, every module forms from w3wp.exe and .NET Framework. It does not reach the customer’s code. By checking with relative code, the null pointer is passed from the caller to the callee. The problem occurs when the callee uses the parameter. Obviously, next we should check why the caller does not pass correct parameter to the callee.
The strange thing is, the pointer in caller’s function is initialized already. It is a valid pointer in the caller just before the call function. It is also pushed to the stack correctly. Why the caller gets an invalid pointer by fetching it from the stack? With single step trace with the execution, the problem is, the caller puts the parameter in callee’s [ebp+8], while the callee fetches the parameter from [ebp+c]. Is it similar with the previous case? The murderer is not compiler this time, it is the DLL version. The caller and callee resides in two different DLLs. The caller’s DLL version is .NET Framework 1.1 version, while the callee’s DLL version is .NET Framework 1.1 SP1. In CLR1.1, the callee accepts 4 parameters, while the new version changes it to accept 5 parameters. The caller passes the parameter as the old style, while the callee fetches it with new style. A typical DLL hell problem. After reinstalling the CLR 1.1 SP1 to match the DLL version, and reboot, it works fine.
Why the DLL files versions is different, possible causes:
1. After the .NET Framework 1.1 SP1 is installed, the customer does not reboot it. Some DLL in use is locked and pending to update after the reboot.
2. The web server is a node of Application Center cluster. The cluster sever will sync the file versions in all the nodes. If we just apply .NET Framework 1.1 SP1 in a single node, the file version may be rolled back by the application center:
PRB: Application Center Cluster Members Are Automatically Synchronized After Rebooting
https://support.microsoft.com/kb/282278/en-us
Discussion:
Release compilation is always faster than debugging?
Run the following code in release and debug mode, compare the performance. The debug version will be faster than release version. Why?
long nSize = 200;
char* pSource = (char *)malloc(nSize+1);
char* pDest = (char *)malloc(nSize+1);
memset(pSource, 'a', nSize);
pSource[nSize] = '\0';
DWORD dwStart = GetTickCount();
for(int i=0; i<5000000; i++)
{
strcpy(pDest, pSource);
}
DWORD dwEnd = GetTickCount();
printf("%d", dwEnd-dwStart);
Write your own strcpy function, compare the performance with the default version shipped in VS. Can you do faster than the default version?
In above samples, the decisive factors are:
1. In 32 bits chips, try to move a DWORD once instead of 4 times byte move. Pay attention to the align of the 4bytes boundary.
2. Here the strcpy is called many times. Using inline version reduces the high cost of the call opcode.
Next I will discuss section2, exception handling.