2008-11-27:

Freedom for everything - total annihilation of process memory

c++:assembler:windows:winapi:medium
Sitting in my hotel room at the Polish edition of PyCON, I started to think what would happen, if a normal Windows process wipes out (almost) all of it's memory. By "wipe out" I mean to free/unmap what is possible (VirtualFree and UnmapViewOfFile), and overwrite with zeroes the rest. I've started to experiment with this, wanting to know how will the system, and other applications, react to this uncommon process condition. Below I describe the creation of a test application (I've found a few interesting (imho) problems), and a funny thing OllyDbg does while attaching to such a process.

I had a few ideas about how to create a process that has wiped out memory (should the process clean it's memory? or maybe a parent/debugger/other process should clean his memory? etc..), and finally I've decided that the process should clean the memory itself.

The annihilating process is split into two parts. The first part is written in C++, comfortably uses WinAPI functions, and frees what is possible using the VirtualFree function. It also copies the second part into Thread Environment Block and calls it.
The Second part is written in assembler (the Netwide Assembler aka NASM dialect) and uses syscalls to unmap the rest of the memory.
The whole algorithm looks like this:

1. The process copies the "second parts" code to TEB
2. Then it uses VirtualProtect with PAGE_EXECUTE_READWRITE on the user memory space (from 00000000 to 7FFFFFFF in my case, I didn't use the /3G flag, and the application itself was x86 32 bits).
3. Then it does VirtualFree with MEM_RELEASE on the whole user-space memory except for the main threads stack.
4. And it jumps to TEB to the second part of the code
5. The second part starts with a loop that, using the syscall NtUnmapViewOfSection, unmaps the rest of the memory.
6. And falls into an infinite loop - so I can attach at the end with some debugger/monitor and see what's up.

Of course there were a few complications. The first among them - how to get TEB linear address?

As one may know, TEB is located at virtual address 0 in the FS segment, but the C++ language without in-lined assembler can only operate on linear address. To solve this I've used a short assembler function that does push fs; pop eax, and a WinAPI function GetThreadSelectorEntry - it returns the LDT/GDT entry related to the given segment. So the code acquiring the TEB linear address looks like this:

WORD GetFS(void) { __asm__(".ascii \"\\x0F\\xA0\\x58\""); }
 [...]
 LDT_ENTRY ldt;
 HANDLE thdl = GetCurrentThread();
 GetThreadSelectorEntry(thdl, GetFS(), &ldt);

 DWORD Off = ldt.BaseLow | (ldt.HighWord.Bytes.BaseMid << 16) | (ldt.HighWord.Bytes.BaseHi << 24);
 BYTE *Teb = (BYTE*)Off;


I'll just mention that for the TEB there is a small page (0x1000 bytes) allocated in memory, and not all of the page is used up by TEB - enough room is left to fit some code there. Using empirical methods I found out that TEB+800h is a good place to store the second parts code.

The second complication was the position of the stack.

The problem with the stack is that in the beginning only a small fraction of it is allocated, but a much larger amount is reserved for future allocation (while using "push" instruction etc the stack grows - the memory is being allocated). So if the VirtualFree function hits the reserver but not yet allocated region of the stack, the whole stack is freed anyway - and I don't want that to happen. So there is need to find the stack size. It can be done in a few ways:

- One can get the address of the "end" of the stack (the highest address) from TEB, and then by moving to the lower addresses, "touch" bytes belonging to them (so that the allocation is called), and check when the exception will be raised - then stop, and write down the last good address, or now check the value of the "start" of the stack (which changes with each allocation).
- One can get the address of the "end" of the stack from TEB, and then read the OptionalHeader.SizeOfStackReserve field from the main executable's PE header, do a substraction, and thanks to that, get the address of the stack start. However there are a few problems... 1st, this field is optional, and 2nd, it's just a suggestion, so the PE loader might ignore it.
- One can get the address of the "end" of the stack from TEB, and then using the VirtualQuery function get information about fragments of the memory, and back up into lower addresses until finding a page with MEM_FREE state (the stack has either MEM_COMMIT or MEM_RESERVED states).

I've used the last method, but I think the first one is even better (however I'm open to any suggestions, since to tell you the truth, I don't like any of these ideas ;>). The code for the last method looks like this:

 DWORD ThreadStackTop    = *(DWORD*)&Teb[8]; // Lower side
 DWORD ThreadStackBottom = *(DWORD*)&Teb[4]; // Higher side

 MEMORY_BASIC_INFORMATION MemInfo;
 for(;;)
 {
   VirtualQuery((LPCVOID)(ThreadStackTop), &MemInfo, sizeof(MemInfo));
   if(MemInfo.State == MEM_FREE)
     break;

   ThreadStackTop -= 0x1000;
   printf("Seeking stack size: %.8x: %.8x (%s)\r", ThreadStackTop, MemInfo.RegionSize,
       MemInfo.State == MEM_COMMIT ? "MEM_COMMIT" :
       MemInfo.State == MEM_FREE ? "MEM_FREE" : "MEM_RESERVE"

       );
 }
 putchar('\n');


The rest of the code - that copies the shellcode (second parts code; I use the somewhat incorrect term shellcode since it's copied as opcodes), calls VirtualProtect, VirtualFree and jumps to the second part looks like this:

 // Copy shellcode
 size_t CodeSize = 0;
 uint8_t* Code = FileGetContent("shellcode", &CodeSize);
 printf("Shellcode: %.8x bytes\n", CodeSize);

 puts("Copying shellcode...");
 memcpy(&Teb[0x800], Code, CodeSize);

 // Remove memory protection
 puts("Removing memory protection...");
 DWORD Addr = 0;
 DWORD OldPrivs;
 while(Addr < 0x80000000)
 {    
   VirtualProtect((LPVOID)Addr, 1, PAGE_EXECUTE_READWRITE, &OldPrivs);
   Addr += 0x1000;
 }

 // Free memory excluding the stack
 puts("Freeing memory...");
 Addr = 0;
 while(Addr < 0x80000000)
 {
   if(!(Addr >= ThreadStackTop && Addr <= ThreadStackBottom))
     VirtualFree((LPVOID)Addr, 0, MEM_RELEASE);
   Addr += 0x1000;
 }

 // Call the second part
 ((void(*)())&Teb[0x800])();


I would like to mention that VirtualFree doesn't really do much here, the most of the memory is mapped anyway (MapViewOfFile & co.), so the biggest part of the work is made by the shellcode anyway. The shellcode itself call in a loop UnmapViefOfFile (using the system call NtUnmapViewOfSection - I've hardcoded it's Windows XP number), and then reaches the infinite loop. The code looks like this:

[bits 32]
[org 0deadbabeh]

;
; Macros
;
%macro UnmapViewOfFile 1
 push %1 ; Address
 push -1 ; Current process
 call NtUnmapViewOfSection
%endmacro

;
; Code
;
start:
 ;
 ; Unmap memory from 0 to 0x7FFFFFFF
 ;
 xor ebx, ebx
 __unmap_loop_start:

   ; Call
   UnmapViewOfFile   ebx

   ; Iterate
   add ebx, 1000h
   test ebx, ebx
   jns __unmap_loop_start ; Do not call unmap over 0x80000000

 ;
 ; Do something - now the process contains only:
 ; - TEB (unfreeable)
 ; - PEB (unfreeable)
 ; - Main stack thread
 ; - Kernel memory area (unfreeable)
 ;
 jmp short $

;
; Syscall operations
;
%define SYSCALL_NtUnmapViewOfSection 10Bh ; XP SP0/SP1/SP2/SP3

KiFastSystemCall:
 mov edx, esp
 sysenter
 ret

NtUnmapViewOfSection:
 mov eax, SYSCALL_NtUnmapViewOfSection
 call KiFastSystemCall
 retn 8


And all would be well, but the app will terminate itself in one place ;>

The problem is related with the sysenter and sysexit mechanism. On windows XP sysexit does not return to the code thats directly after sysenter (although it might seem it does), but it returns to some address thats stated at address 7ffe0304 in linear space. The stated address belong usually to the ntdll.KiFastSystemCallRet function. So if one wants to use the sysenter/sysexit interface, one cannot free NTDLL.dlls memory. But we need/want to free it!
There should be a few ways to do it:
- One can zero out the memory of ntdll.dll, leaving just the C3 aka ret in the KiFastSystemCallRet function. It seems OK, but it destroys the concept, since ntdll is not freed.
- When the sysexit returns the CPU to the KiFastSystemCallRet which does not exist, an exception will be raised. So why not create a SEH entry to catch the exception, and use the CONTEXT structure to correct the EIP to some more warm and comfy place. It should work. But it won't ;> The whole idea falls because the kernel after getting an exception and dealing with it kernel-mode style doesn't go the handler function thats stated in the SEH, no, what it does, is it call a function ntdll.KiUserExceptionDispatcher function, thats located in the ntdll module, which we wanted to get rid of in the first place.
- Let's forget for a moment about the fancy "hi-tech" sysenter, and use the old-school int 0x2E, after handling which the kernel-mode gently returns to the code thats directly after the int instruction. Bingo.

The fixed version of the shellcode differs from the original only in one place:

;
; Syscall operations
;
%define SYSCALL_NtUnmapViewOfSection 10Bh ; XP SP0/SP1/SP2/SP3

KiSlowSystemCall:
 lea edx, [esp+8]
 int 0x2E
 ret

NtUnmapViewOfSection:
 mov eax, SYSCALL_NtUnmapViewOfSection
 call KiSlowSystemCall
 retn 8




The situation after running the above program is quite good. The only areas lying in the memory are:
- the stack
- TEB
- PEB
- and the untouchable read-only owned by kernel memory page that resides at 7ffe0000 address (it's that area where the time counters are placed, and the path to the windows directory)

Can anything more be done here? Yes, the stack can be removed, and the esp can be moved to TEB or PEB. However I didn't do that in the above example, we wouldn't want the esp to get claustrophobia now would we? ;>

OK, the program works. How does the system react to such a creature? Well, after a few tests (just a few, I'll write about more later) it didn't behave abnormal in any way.
But OllyDbg did! While attaching Olly to a running process with non of the standard libraries in the memory, the remote thread created by Olly throws an exception at the first function call. So checking the thread count would be a good way to detect the Olly Debugger.

And thats all for this post. As a "farewell" gift - the source + binary, and some link.

annihilate_memory.zip (7kb, source + exec)

Stuff worth seeing:
System call optimization with the SYSENTER instruction
A catalog of NTDLL kernel mode to user mode callbacks, part 2: KiUserExceptionDispatcher

Comments:

2008-11-27 23:50:32 = halsten
{
Hey,

Interesting stuff my friend. :)

Regards,
halsten
}
2008-11-28 03:47:14 = Gynvael Coldwind
{
Hi halsten ;>

Thanks for viewing/posting ;>

Take care,
Gyn
}
2008-11-28 04:34:42 = omeg
{
Fun stuff.

You can use __readfsdword(x) intrinsic to read fs:[x]. I try to avoid inline assembly where possible, it's not very x64-friendly unless you want to link external OBJs. Well, you'd need totally different code for x64 anyway. ;)

There is also a supported/public way to get PEB using NtQueryProcessInformation, but as for TEB.. go for FS:0 (true for user-mode only though).
}
2008-11-28 05:44:57 = Gynvael Coldwind
{
Hi omeg ;>

I didn't know about __readfsdword(x) intrinsic, thanks ;>
However I'm afraid I couldn't use it in my case anyway, since I'm currently using MinGW GCC which does not support readfsdword().

Thanks for your comment ;>

Take care,
Gyn
}
2009-09-11 04:13:33 = BanMe
{
some great stuff here..I cant believe I didnt find this site earlier.. :d I experimented with something similar to this,but lost the idea before it came to fruition..excellent work :D


}
2011-05-28 22:11:33 = King Of Core
{
cooll as bs, and we can make some intereting thing from this, like a delete self-execuable.....
}
2013-10-20 18:08:54 = freop
{
You might find this link interesting as well
http://www.catch22.net/tuts/self-deleting-executables
}

Add a comment:

Nick:
URL (optional):
Math captcha: 9 ∗ 5 + 6 =