Friday, October 05, 2007

Fun With WinDbg

I did some debugging on an old legacy reporting system this week, using WinDbg. The reporting engine terminated prematurely after something like 1000 printouts.

After attaching WinDbg and letting the reporter run for half an hour, a first chance exception breakpoint hit because of this memory access violation:

(aa8.a14): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=665b0006 ecx=7c80ff98 edx=00000000 esi=00000000 edi=00000000
eip=665a384f esp=0012bdc4 ebp=00000005 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246
*** ERROR: Symbol file could not be found. Defaulted to export symbols for GEEI11.dll -
GEEI11!FUNPower+0x15f:
665a384f 668b7804 mov di,word ptr [eax+4] ds:0023:00000004=????


Trying to access address 0x0000004 ([EAX+4]), one of the reporting DLLs was obviously doing some pointer arithmetics on a NULL pointer. The previous command was a call to GEEI11!WEP+0xb47c, which happened to be the fixup for GlobalAlloc:

665a3849 ff157c445b66 call dword ptr [GEEI11!WEP+0xb47c (665b447c)]
665a384f 668b7804 mov di,word ptr [eax+4] ds:0023:00000004=????


GlobalLock takes a global memory handle, locks it, and returns a pointer to the actual memory block, or NULL in case of an error. According to the Win32 API calling conventions (stdcall), EAX is used for 32bit return values.

The reporting engine code calling into GlobalLock was too optimistic and did not test for a NULL return value.

The next question was, why would GlobalLock return NULL? Most likely because of an invalid handle passed in. Where could the parameter be found? At the ESI register - it was the one pushed onto the stack before the call to GlobalAlloc, thus must be the one and only function parameter, and it is callee-saved, so GlobalAlloc had restored it in its epilog.

665a3848 56 push esi
665a3849 ff157c445b66 call dword ptr [GEEI11!WEP+0xb47c (665b447c)]

0:000> r
eax=00000000 ebx=665b0006 ecx=7c80ff98 edx=00000000 esi=00000000 edi=00000000
eip=665a384f esp=0012bdc4 ebp=00000005 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010246


As expected, ESI was 0x00000000, and GetLastError confirmed this as well:

0:000> !gle
LastErrorValue: (Win32) 0x6 (6) - Das Handle ist ungültig.
LastStatusValue: (NTSTATUS) 0xc0000034 - Der Objektname wurde nicht gefunden.


Doing some further research, I found out that the global memory handle was NULL because a prior invocation of GlobalAlloc had been unsuccessful. Again, the caller had not checked for NULL at that point. And GlobalAlloc failed because the system had run out of global memory handles, as there is an upper limit of 65535. The reporting engine leaked those handles, neglecting to call GlobalDelete() on time, and after a while (1000 reports) had run out of handles.

By the way, I could not figure out how to dump the global memory handle table in WinDbg. It seems to support all kinds of Windows handles, with the exception of global memory handles. Please drop me a line in case you know how to do that.

Now, there is no way to patch the reporting engine as it's an old third party binary, so the solution we will most likely implement is to restart the engine process after a while, so all handles are free'd by terminating the old process.