SQL Server on Linux: An LLDB Debugging Tale
You are aware of our statements indicating “Microsoft Loves Linux.” Over the last couple of years the open source activities at Microsoft have accelerated all around me. Recently I blogged about the design of of the debugger bridge and the use of LLDB. In this post I want to highlight my recent debugging session into LLDB and show you how we are contributing to LLDB.
Scenario
We captured a core dump on a 2TB RAM system. When loading it with LLDB the core dump would open but we could not symbolize stacks or see specific memory that was expected to be present in the dump. Opening the dump with gdb allowed us to see the memory region and dumping the PT_LOAD headers with readelf showed the memory region was present in the dump.
Debugging Steps
I started with LLDB logging. LLDB has logging capabilities to assist in understanding what is transpiring.
Command Example: log enable -gv –- lldb dyld module platform
You can use the command “log list” to see the channel (gdb-remote, LLDB, …) and supported logging categories. Using the -O command line parameter, when you start LLDB, indicates to LLDB to execute the command before the files are loaded. This lets you log information while loading the core dump.
Launch: lldb -c core.sqlservr.8917 -f ./bin/sqlservr.dbg -O"log enable -gv -- lldb dyld platform"
- LLDB loads the target and associated shared libraries (I.E. DLLs) during the target create action.
- In the log you can see LLDB is setting the path to that of the specified executable and using that path to look for shared libraries (.so files.)
- The rendezvous logic is where LLDB attempts to find information in the dump overlapping the shared modules. When I stepped through the logic the DYLD is the key. DYLD (dynamic loading) is where LLDB is using the core dump information (named memory regions such as [vdso]) to locate the dynamic object (list maps) and other information in an attempt to obtain the shared module details.
(lldb) log enable -gv -- lldb dyld platform
(lldb) target create "./bin/sqlservr.dbg" --core "core.sqlservr.8917"
DYLDRendezvous::DYLDRendezvous exe module executable path set: './bin/sqlservr.dbg'
DynamicLoaderPOSIXDYLD::DidAttach() pid 8917
…
DynamicLoaderPOSIXDYLD::ResolveExecutableModule - got executable by pid 8917: ./bin/sqlservr.dbg
DynamicLoaderPOSIXDYLD::DidAttach pid 8917 executable './bin/sqlservr.dbg', load_offset 0x7f5e4bd16000
DynamicLoaderPOSIXDYLD::DidAttach pid 8917 added executable './bin/sqlservr.dbg' to module load list…
You can see LLDB reading over the sections by name but LLDB terminates the search early and can’t find the symbol sections. Without the symbol sections variables and stack decoding is limited.
…
SectionLoadList::SetSectionLoadAddress (section = 0x5642dc4a62e0 (./bin/sqlservr.dbg..bss), load_addr = 0x00007f5e4bfb5e80) module = 0x5642dc4919b0
DYLDRendezvous::Resolve address size: 8, padding 4
ResolveRendezvousAddress info_location = 0x7f5e4bfb4dc0
ResolveRendezvousAddress reading pointer (8 bytes) from 0x7f5e4bfb4dc0
ResolveRendezvousAddress FAILED - could not read from the info location: core file does not contain 0x7f5e4bfb4dc0
DYLDRendezvous::Resolve cursor = 0xffffffffffffffff
DynamicLoaderPOSIXDYLD::DidAttach() pid 8917 rendezvous could not yet resolve: adding breakpoint to catch future rendezvous setup
DynamicLoaderPOSIXDYLD::ProbeEntry pid 8917 GetEntryPoint() returned address 0x7f5e4bd2d83c, setting entry breakpoint
DynamicLoaderPOSIXDYLD::DidAttach told the target about the modules that loaded:
-- [module] ./bin/sqlservr.dbg (pid 8917) Core file '/media/rdorr/Scratch1/Temp/CoreWontLoad/core.sqlservr.8917' (x86_64) was loaded.
Why is the read failing?
Taking a closer look at the log and stepping though the LLDB code I found the read failure occurring because the memory address could not be located in the tracked VM map. The VM map is built from the PT_LOAD entries in the core dump file. PT_LOAD is an entry containing the address range, permissions and offset in the dump where the memory region is stored.
I turned to readelf –h to dump the ELF header associated with the core dump file.
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: CORE (Core file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 64 (bytes into file)
Start of section headers: 16880770624 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 65535 (106714)
Size of section headers: 64 (bytes)
Number of section headers: 0 (106716)
Section header string table index: 65535 (106715)
Interestingly, I found the program_headers printed 65535 (106714). What this tells us is the ELF header has a special marker (0xFFFF=65535.) When this marker is placed in the program and section header count members, it indicates that section_header[0] holds the actual count values. The ELF header is limited to a 16 bit value or max of 0xFFFF and on the 2TB system the number of memory regions exceeded 65535 when the dump was captured.
You can use readelf –l or –S to see the program/section header information as well as dump the PT_LOAD. Using this technique I was able to confirm the PT_LOAD entries existed for the region in which the read was returning failure on. I could also deduce that the PT_LOAD entry for this region was beyond the 65535th entry in the core dump.
LOAD 0x00000003ee293e28 0x00007f5e4bfb1000 0x0000000000000000
0x0000000000004000 0x0000000000004000 R 1ResolveRendezvousAddress FAILED - could not read from the info location: core file does not contain 0x7f5e4bfb4dc0
LOAD 0x00000003ee297e28 0x00007f5e4bfb5000 0x0000000000000000 0x0000000000001000 0x0000000000001000 RW 1
So why is LLDB indicating the entry is not part of is VM map when gdb and readelf confirm the PT_LOAD’s existence.
(gdb)
x/100x 0x7f5e4bfb4dc00x7f5e4bfb4dc0 0x00000001 0x00000000 0x0000019b 0x00000000
With a bit more stepping I found that LLDB’s, load core logic is looping over the program headers and building up the VM map. The loop is controlled by the ELF header value, in this case 65535. LLDB didn’t have logic to detect the 0xFFFF signature and read the section_header[0], at the ‘Start of section headers’ offset, in order to obtain the actual counts. The gdb and readelf utilities read the proper value of (106714) but LLDB used 65535. Thus, LLDB is not reading the available PT_LOAD headers and the region of memory needed appears after the 65535th entry.
In parallel I made changes to an internal utility (future blog to come) and my esteemed collogue Eugene, who did the majority of heavy lifting for dbgbridge, made changes to LLDB code. Once Eugene allowed LLDB to read all program and section headers we are able to load and properly debug the large, core dump. The change is now under review so it can be added back to the LLDB code base.
Along with this contribution we have uncovered a few other issues, provided possible fixes, and shared these with the LLDB community. The lldb-dev community has been very responsive to engage and help address issues. This is really a neat experience and I can truly see our use of LLDB driving more advancements for everyone.
FEB 1, 2017 UPDATE – CHANGE ACCEPTED
This revision was automatically updated to reflect the committed changes.
Commit rL293714: Open ELF core dumps with more than 64K sections
Changed prior to commit:
https://reviews.llvm.org/D29095?vs=86148&id=86507#toc
Repository:
https://reviews.llvm.org/D29095
Bob Dorr – Principal Software Engineer SQL Server