Inject Me x64 Injection-less Code Injection
Malware authors are always looking for new ways to achieve code injection, as it enables them to run their code in remote processes. Code Injection allows hackers to better hide their presence, gain persistence and leverage other processes’ data and privileges.
Finding and implementing new, stable methods for code injection is becoming more and more challenging as traditional techniques are now widely detected by various security solutions or limited by native OS protections.
Inject-Me is a new method to inject code into a remote process in x64. Inject-Me is in fact “injection-less”, meaning that the remote (target) process is manipulated to read data from the injecting process, copy and execute it. The manipulation is mainly based on abusing ReadProcessMemory and calling conventions in X64. In addition to presenting Inject-Me, we mention a generalized approach to copying data in remote processes to recreate shellcode from the injecting process.
Technical Review
In this technical review, the method Inject-Me will be described in detailed flow. Since a second method of copying data in remote processes is used to achieve the code injection, it will be referred to while describing Inject-Me.
In addition, because this method is very complicated, from time to time an execution flow will be presented to avoid confusion.
Prior to describing the method, an important restriction should be discussed while looking for this method. The method must be a new method and mustn’t use any of the known injection methods. This means functions like WriteProcessMemory or NtMapViewOfSection can’t be used.
So, what was the idea behind this code injection?
The idea behind this method is that in x86_64 WinAPI, the first four arguments are passed in the registers RCX, RDX, R8, R9 and parameters after that are passed on the stack, in contrast to x86 where all parameters are passed on the stack.
If we will find a function receiving four parameters that read data from somewhere to a destination in memory, we can change the context of a thread in a remote process using SetThreadContext and pass the parameters we want so it will read data from a location we control.
The first thought was ReadProcessMemory, we know that this function can read data from the injecting process. Let’s look at the definition of ReadProcessMemory.
Bummer! This function gets five parameters and not four. But if we read carefully the definition of the last parameter lpNumberOfBytesRead we can see that this parameter can be NULL. So, all we need is a stack that contains only NULL in it. This is not hard to achieve since VirtualAllocEx allocates memory in the remote process and zero all the allocated memory! Nice!
Notice that the stack should be big enough for extra function calls, in our POC we allocated a 2kb stack. When the term beginning of stack is presented later in this blog, the real address is the allocated stack address + 1kb. By looking further in the function definition, we can see that ReadProcessMemory gets a handle to a process in which to read from.
Now we need to get a valid handle of the injecting process that will exist in the injected process. Here comes to the rescue the function DuplicateHandle, this function duplicates a handle from one process to another so we can duplicate a handle of our process to the remote process.
The other three parameters are easy to set:
lpBaseAddress is the location of the data we want to copy (our shellcode).
lpBuffer is the address of the allocated buffer in the remote process using VirtualAllocEx (or if you have RWX code cave).
nSize is the size of the data (our shellcode).
The context of a thread executing this function should look like this:
RCX = duplicated handle value in the remote process
RDX = lpBaseAddress
R8 = lpBuffer
R9 = nSize
RSP = Allocated zeroed stack
Execution flow:
- Find the process to inject to
- Duplicate a handle of our process to the remote process using DuplicateHandle
- Allocate Memory RWX for our shellcode using VirtualAllocEx
- Allocate Memory RW for our dummy stack using VirtualAllocEx
- Put our shellcode in our memory (Create a variable)
- Save in variable the shellcode size
Let’s discuss another problem we need to solve; if our stack is zeroed when ReadProcessMemory returns it is going to read the return address from the stack, the address will be 0 and we will get an Access Violation exception. To solve this, we need to write at the beginning of the stack the address of a function that will exit the thread and won’t return. Windows has such a function at ntdll.dll RtlExitUserThread.
As stated earlier, we are not allowed to write to the remote process using one of the known injection methods.
Here comes to the rescue the function NtQueueApcThread, this function makes a thread to execute code from a certain address and passes to this code three parameters. This could help us since by luck CopyMemory, or more specifically, RtlCopyMemory gets three parameters. By creating a remote thread and invoking RtlCopyMemory we can copy data inside the remote process.
So now we can copy data, but where can we find a place that contains the address of RtlExitUserThread?
Kernel32.dll uses RtlExitUserThread from ntdll.dll and therefore the address it’s written to is IAT. Because kernel32.dll and ntdll.dll share the same base address in all processes we can look for the address in our memory and it will be at the same address in the remote process memory (We can also scan the remote process, but it’s a more annoying and intrusive process).
After we found the address in our kernel32 IAT we create an alert-able thread in the remote process, call NtQueueApcThread which will invoke RtlCopyMemory in the remote process thread. For the source parameter, we use the address of IAT in kernel32 of RtlExitUserThread and the destination parameter will be the beginning of our allocated stack (the return address). And voila! The address of RtlExitUserThread is at the beginning of the allocated stack.
A short comment:
This method of copying data inside a remote process could be used as another injection method. By looking in our ntdll/kernel32 we can find each byte of our shellcode and make the other process copy byte after byte. (This means the amount of times we will call NtQueueApcThread is equal to the number of bytes in the shellcode).
Execution flow:
- Find RtlExitThread in kernel32 IAT in our process.
- Create an alert-able thread using CreateRemoteThread suspended (call Sleep 1 millisecond).
- Find RtlCopyMemory in our ntdll.dll
- Call NtQueueApcThread calling the function RtlCopyMemory with src= RtlExitThread IAT location and dst=beginning of our dummy stack.
- ResumeThread
- WaitForSingleObject on thread to be sure it has done its job before we continue.
The next problem we need to solve is that we don’t want to hijack a thread by using SetThreadContext, so we need to do it on a new thread. If we change the RIP of a new thread that hasn’t initialized yet (the suspended thread is not yet initialized) we will receive exception 0xC000000D STATUS_INVALID_PARAMETER and the process will crash. If we want to wait for the thread to initialize, we need a thread that runs for a long time or for an infinite time, so we won’t miss our window of opportunity to manipulate it.
If we look at what happens in the remote process after CreateRemoteThread is called, we can see the new thread starts at RtlUserThreadStart. Let’s look on RtlUserThreadStart
We can see that it calls a function, this function is LdrpDispatchUserCallTarget so we also look on that function.
This function runs some tests and then it does a jmp to RAX which is going to be the entry point we passed to CreateRemoteThread.
As stated earlier, we want a thread that will run for an infinite time. We can do this by jumping to a location in memory that has the opcode jmp RBX. We will use jmp RBX since the register is not used by the RtlUserThreadStart function in different versions of windows (tested on Windows 10, 8.1 and 7). We will call CreateRemoteThread with an entry point that points to our jmp RBX opcode, then we will set the value of RBX in the remote thread to point our jmp RBX opcode and resume the thread. First, a jmp from LdrpDispatchUserCallTarget will occur and then it will non-stop jmp to the same address over and over since RBX won’t change.
We can do that the same way we copied RtlExitThread using NtQueueApcThread.
We will look for jmp RBX in ntdll (it is a 2 bytes opcode, so if we can’t find it we can copy byte after byte). And since ntdll location is the same for all process we just need to scan our memory for this opcode in our version of ntdll and then copy it in the remote process.
Execution flow:
- Allocate memory RWX for our jmp RBX opcode using
- Find jmp RBX opcode 0xffe3 in our version of ntdll.
- Copy jmp RBX opcode from ntdll version of the remote process to the allocated memory using NtQueueApcThread
Now we got to the last stage, the stage where we execute ReadProcessMemory.
- To do so, we get the address of ReadProcessMemory.
- We use CreateRemoteThread and use our copied jmp RBX opcode as the entry point.
- We use SetThreadContext to set RBX to point our jmp RBX opcode.
- Then we resume the thread.
- This will create a remote thread with an infinite loop. We suspend the thread and check that RIP (using GetThreadContext) is really pointing to our jmp RBX opcode (so we know the thread already initialized).
- Then we use SetThreadContext:
RCX = duplicated handle value in the remote process
RDX = lpBaseAddress
R8 = lpBuffer
R9 = nSize
RSP = Allocated zeroed stack
RBX = Address of ReadProcessMemory
And we resume the thread. Because the opcode that is executed is jmp RBX we are going to jump to ReadProcessMemory with the right parameters. It will read the shellcode from our process and then return to RtlExitUserThread and the thread will terminate. All that is left to do is execute the shellcode using one of the many ways to execute code in Windows, for example, CreateRemoteThread.
Full Execution Flow
- Find the process to inject to
- Duplicate a handle of our process into the remote process using DuplicateHandle
- Allocate Memory RWX for our shellcode using VirtualAllocEx
- Allocate Memory RW for our dummy stack using VirtualAllocEx
- Put our shellcode in our memory (Create a variable)
- Save the shellcode size in a variable
- Find RtlExitThread in kernel32 IAT in our process
- Create an alert-able thread using CreateRemoteThread suspended (call Sleep 1 millisecond)
- Find RtlCopyMemory in our ntdll.dll
- Call NtQueueApcThread calling the function RtlCopyMemory with src= RtlExitThread IAT location and dst=beginning of our dummy stack
- ResumeThread
- WaitForSingleObject on thread to be sure it has done its job before we continue
- Allocate memory RWX for our jmp RBX opcode using VirtualAllocEx
- Find jmp RBX opcode 0xffe3 in our version of ntdll
- Copy using NtQueueApcThread trick jmp RBX opcode from ntdll version of the remote process to the allocated memory
- Get ReadProcessMemory address
- CreateRemoteThread pointing to our jmp RBX opcode
- Use SetThreadContext to set RBX to point our jmp RBX opcode
- Resume the thread and let it execute (so it will get to the infinite loop)
- Suspend the thread
- Use GetThreadContext to check if RIP equals to our jmp RBX opcode address
- Use SetThreadContext to set the context
RCX = duplicated handle value in the remote process
RDX = lpBaseAddress
R8 = lpBuffer
R9 = nSize
RSP = Allocated zeroed stack
RBX = Address of ReadProcessMemory
- Resume the thread and let ReadProcessMemory execute
- Execute the shellcode