Kagura (神楽) is the sacred dance of the Shinto gods, a choreography so precise that every gesture, every step has meaning. On Windows, the call stack is that same kind of dance. Every function call adds a frame. Every return removes one. And the EDR is watching every move, validating every step against the choreography it expects.
If you want to forge a fake stack that passes inspection, you need to learn the dance first.
This post walks you through the entire journey, from "what is a call stack" to "how do I build a fake one that bypasses CrowdStrike." Along the way, I'll show you why I built Kagura-StackWalker, and how it makes every step of that journey concrete.
Part I: The Call Stack
What happens when you call a function
Every program is a chain of function calls. main() calls DoWork(), which calls ReadFile(), which calls NtReadFile(). At any point during execution, the CPU needs to remember where to go back when the current function finishes.
That's what the call stack is: a region of memory that records the breadcrumb trail of function calls. When function A calls function B, the CPU pushes the return address (the instruction right after the CALL) onto the stack. When B finishes and executes ret, the CPU pops that address and jumps back to A.
Visually, if main calls DoWork which calls ReadFile:
STACK (grows downward in memory)
+---------------------------+
| main's frame | <- main's local variables, saved registers
| return addr → OS loader | <- where main returns to
+---------------------------+
| DoWork's frame |
| return addr → main+0x42 | <- where DoWork returns to (inside main)
+---------------------------+
| ReadFile's frame |
| return addr → DoWork+0x1F| <- where ReadFile returns to (inside DoWork)
+---------------------------+
<- RSP (the Stack Pointer register) points here (top of stack)
This is the call stack (also called the backtrace or stack trace). Every debugger shows it. Every crash dump contains it. And every EDR inspects it.
What's inside a stack frame
Each function doesn't just store a return address. It needs space for its own work. A stack frame is the function's personal workspace on the stack, and it typically contains:
- The return address: where to go back when done (8 bytes on x64)
- Saved registers: non-volatile registers (registers that must have the same value after a function returns as before it was called; the function must save and restore them if it uses them) for its caller (RBX, RBP, RSI, RDI, R12-R15)
- Local variables: the function's own data
- Shadow space: 32 bytes reserved by the caller for the callee's use (the Windows x64 calling convention mandates this space so the callee can spill its register arguments to memory for debugging)
On Windows x64, the frame layout is described by the compiler in a structure called UNWIND_INFO, stored in the PE (Portable Executable) file's .pdata section. PE is the binary format used by .exe and .dll files on Windows. More on this later. It's the key to everything.
Why should you care
If your code just calls functions normally, you never think about this. The compiler handles everything.
But if you're doing offensive security (injecting shellcode, performing indirect syscalls, spoofing call stacks), you're building stack frames by hand. And if you get a single offset wrong, the EDR catches you.
A quick note on hexadecimal
Throughout this post (and in Kagura's output), you'll see numbers prefixed with 0x. This means hexadecimal (base 16), not decimal (base 10). If you're not comfortable with hex, here's what you need to know.
Hex uses 16 digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F. After F (15 in decimal) comes 10 (16 in decimal).
To convert hex to decimal, multiply each digit by its place value. For example: 0x60 = 6 x 16 + 0 = 96. Another: 0x78 = 7 x 16 + 8 = 112 + 8 = 120. And 0xB8 = 11 x 16 + 8 = 176 + 8 = 184 (B = 11 in hex).
Common values you'll see in stack frames:
| Hex | Decimal | What it usually means |
|---|---|---|
| 0x8 | 8 | Size of one register or return address (8 bytes on x64) |
| 0x10 | 16 | Two registers, or 16-byte alignment boundary |
| 0x20 | 32 | Shadow space (mandatory 32 bytes in Windows x64 calling convention) |
| 0x28 | 40 | Shadow space (0x20) + 1 pushed register (0x8) |
| 0x30 | 48 | Shadow space + 2 pushed registers, or 48-byte frame |
| 0x38 | 56 | Common return address offset in a 0x40 frame |
| 0x40 | 64 | Common small frame size |
| 0x60 | 96 | Common local allocation size |
| 0x78 | 120 | Return address offset in a 0x80 frame |
| 0x80 | 128 | Common medium frame size |
How to do the math:
When Kagura tells you a function has 3 pushes and a 0x60 allocation:
3 pushes: 3 x 0x8 = 0x18 (decimal: 3 x 8 = 24)
allocation: 0x60 (decimal: 96)
----
total: 0x78 (decimal: 24 + 96 = 120)
+ return addr: 0x08 (decimal: 8)
----
frame size: 0x80 (decimal: 120 + 8 = 128)
The key addition that confuses people: 0x78 + 0x08 = 0x80, not 0x86. In hex, 8 + 8 = 0x10 (sixteen). Write 0, carry 1. 7 + 0 + 1 = 8. Result: 0x80.
A few more examples to build intuition:
0x20 + 0x08 = 0x28 (decimal: 32 + 8 = 40)
0x28 + 0x08 = 0x30 (decimal: 40 + 8 = 48)
0x30 + 0x30 = 0x60 (decimal: 48 + 48 = 96)
0x48 + 0x08 = 0x50 (decimal: 72 + 8 = 80)
0x60 + 0x18 = 0x78 (decimal: 96 + 24 = 120)
When in doubt, open a calculator in programmer mode (Windows: calc.exe, switch to "Programmer") and toggle between HEX and DEC.
One last thing: sizes on x64 Windows are always multiples of 8 (because everything is 8-byte aligned), and RSP must be 16-byte aligned before a CALL instruction. This means you'll mostly see frame sizes like 0x28, 0x30, 0x40, 0x50, 0x60, 0x80. Never 0x37 or 0x43. If your calculation gives an odd number, something is wrong.
Part II: .pdata and UNWIND_INFO
The problem: how does Windows unwind the stack?
Imagine your program crashes deep inside a function call chain. Windows needs to walk back up the stack, frame by frame, to find an exception handler, clean up resources, and generate a crash dump. But here's the problem: the stack is just raw bytes in memory. There's no label saying "this frame is 0x40 bytes" or "RBP was saved at offset +0x20."
On x86 (32-bit), this was solved with frame pointer chains: every function saved RBP at the start, and RBP pointed to the previous frame's RBP. You could follow the chain like a linked list. Simple, but it wasted a register (RBP was permanently reserved) and made code slower.
On x64, Microsoft took a different approach: metadata-driven unwinding. Instead of using RBP chains at runtime, the compiler writes a blueprint of every function's stack frame directly into the PE file at compile time. This blueprint lives in the .pdata section.
What is .pdata?
The .pdata section (short for "procedure data") is an array of RUNTIME_FUNCTION entries. Every non-leaf function (a function that calls other functions) in the PE has one. Each entry is just 12 bytes:
typedef struct _RUNTIME_FUNCTION {
DWORD BeginAddress; // RVA (Relative Virtual Address: offset from the module's base address) of function start
DWORD EndAddress; // RVA where the function ends
DWORD UnwindData; // RVA pointing to the UNWIND_INFO
} RUNTIME_FUNCTION;
That's it. Three fields that say: "there's a function from address A to address B, and here's how to unwind its stack frame." The array is sorted by BeginAddress, so Windows can do a binary search to find any function in O(log n).
A typical system DLL has thousands of entries. ntdll.dll has ~4500. kernelbase.dll has ~3900. kernel32.dll has ~1200. Kagura parses all of them.
What is UNWIND_INFO?
Each RUNTIME_FUNCTION points to an UNWIND_INFO structure, which is the actual blueprint of the stack frame:
typedef struct _UNWIND_INFO {
BYTE Version : 3; // always 1
BYTE Flags : 5; // 0=nothing, 1=exception handler, 4=chained
BYTE SizeOfProlog; // how many bytes is the prologue
BYTE CountOfCodes; // how many UNWIND_CODE entries follow
BYTE FrameRegister : 4; // 0=no frame pointer, 5=RBP
BYTE FrameOffset : 4; // offset for frame pointer
UNWIND_CODE UnwindCode[]; // the actual operations
} UNWIND_INFO;
The UnwindCode array is the heart of it. Each entry describes one thing the function's prologue (the setup code at the very beginning of the function) did to set up its stack frame. Think of it as a recipe:
| OpCode | Name | What the prologue did |
|---|---|---|
| 0 | UWOP_PUSH_NONVOL | push reg: saved a register onto the stack (RSP -= 8) |
| 1 | UWOP_ALLOC_LARGE | sub rsp, N: allocated N bytes for local variables |
| 2 | UWOP_ALLOC_SMALL | sub rsp, small: allocated a small amount (up to 128 bytes) |
| 3 | UWOP_SET_FPREG | lea rbp, [rsp+X]: set up a frame pointer register |
| 4 | UWOP_SAVE_NONVOL | mov [rsp+N], reg: saved a register at a specific offset |
| 5 | UWOP_SAVE_NONVOL_FAR | Same, but with a larger offset |
| 8 | UWOP_SAVE_XMM128 | Saved a 128-bit XMM register (used in SIMD code) |
| 9 | UWOP_SAVE_XMM128_FAR | Same, larger offset |
| 10 | UWOP_PUSH_MACHFRAME | Hardware interrupt frame (rare, kernel-level) |
A concrete example
Let's look at kernelbase!CreateFileA. Its prologue in assembly is:
CreateFileA:
push r15 ; save R15 (RSP -= 8)
push r14 ; save R14 (RSP -= 8)
push rbp ; save RBP (RSP -= 8)
sub rsp, 0x60 ; allocate 96 bytes for locals + shadow
mov [rsp+0x80], rbx ; save RBX at explicit offset
mov [rsp+0x88], rsi ; save RSI at explicit offset
mov [rsp+0x90], rdi ; save RDI at explicit offset
mov [rsp+0x98], r12 ; save R12 at explicit offset
; ... function body ...
The compiler records this prologue in the UNWIND_INFO as:
CountOfCodes: 12
UnwindCodes:
UWOP_SAVE_NONVOL R12, offset=0x98 (4 → mov [rsp+0x98], r12)
UWOP_SAVE_NONVOL RDI, offset=0x90 (4 → mov [rsp+0x90], rdi)
UWOP_SAVE_NONVOL RSI, offset=0x88 (4 → mov [rsp+0x88], rsi)
UWOP_SAVE_NONVOL RBX, offset=0x80 (4 → mov [rsp+0x80], rbx)
UWOP_ALLOC_SMALL 0x60 (2 → sub rsp, 0x60)
UWOP_PUSH_NONVOL RBP (0 → push rbp)
UWOP_PUSH_NONVOL R14 (0 → push r14)
UWOP_PUSH_NONVOL R15 (0 → push r15)
Note: CountOfCodes is 12, not 8, because some operations occupy multiple slots. Each UWOP_SAVE_NONVOL takes 2 slots (the opcode + an extra slot for the offset value). So 4 SAVE_NONVOL x 2 = 8 slots, plus 1 ALLOC_SMALL = 1 slot, plus 3 PUSH_NONVOL = 3 slots. Total: 12.
From this, the unwinder can calculate the complete frame layout:
3 pushes = 3 × 8 = 0x18 bytes
1 allocation = 0x60 bytes
—————
Total = 0x78 bytes → return address at RSP+0x78
total frame = 0x80 bytes
And it knows exactly where every saved register is: RBX at +0x80, RSI at +0x88, RDI at +0x90, R12 at +0x98, RBP at +0x60, R14 at +0x68, R15 at +0x70. No guessing. The blueprint is complete.
Why this matters for you
Here's the key insight: the EDR uses the exact same mechanism.
When CrowdStrike Falcon stack-walks your thread during a syscall, it doesn't just check return addresses. It reads the .pdata of every module, looks up the RUNTIME_FUNCTION for each frame, parses the UNWIND_INFO, and replays the unwind codes to verify that the stack layout matches what the compiler described.
If you forge a fake frame for CreateFileA but your frame is 0x70 bytes instead of 0x80, the unwinder's math doesn't add up. The reconstructed RSP points to the wrong place. Detection.
The .pdata section is the source of truth, and the unwind codes define the expected frame layout. Kagura reads all of them for you.
Kagura decodes the UNWIND_INFO of CreateFileA: 3 PUSHes, 4 MOV saves, 1 allocation. Every offset calculated automatically.
Part III: Win32 API vs Indirect Syscalls
The normal path: Win32 API calls
When a legitimate application wants to allocate memory, it calls VirtualAlloc(). Here's what actually happens under the hood:
Your code
│
├→ VirtualAlloc() [kernel32.dll / kernelbase.dll]
│ Win32 wrapper: validates parameters, sets up arguments
│ │
│ └→ NtAllocateVirtualMemory() [ntdll.dll]
│ Syscall stub: mov r10,rcx ; mov eax,0x18 ; syscall ; ret
│ │
│ └→ KERNEL [ring 0 / kernel mode]
│ Does the actual memory allocation
│
└— Return value flows back up the chain
Three layers. Your code never talks to the kernel directly. Each layer adds a frame to the call stack. When the EDR inspects the stack during the syscall, it sees:
ntdll!NtAllocateVirtualMemory+0x14 ← syscall happened here
kernelbase!VirtualAlloc+0x62 ← the Win32 wrapper called ntdll
your_app.exe!main+0x42 ← your code called VirtualAlloc
kernel32!BaseThreadInitThunk+0x14 ← thread entry point
ntdll!RtlUserThreadStart+0x21 ← thread root (return addr = 0)
This looks perfectly normal. Every address points inside a legitimate, Microsoft-signed DLL. Every frame has valid unwind data. The EDR is happy.
The problem: EDR hooks
EDR products (CrowdStrike, Defender, SentinelOne) need to see what your code is doing. The most common approach: they overwrite the first bytes of sensitive Nt* functions in ntdll with a JMP to their own monitoring code.
When your process calls NtAllocateVirtualMemory, it hits the EDR's detour first. The EDR logs the call, checks the parameters, inspects the call stack, and then (if allowed) executes the real syscall.
This means: if you're doing something the EDR doesn't like, it catches you before the syscall even reaches the kernel.
The offensive response: indirect syscalls
The idea: bypass the EDR's hook entirely. Instead of calling NtAllocateVirtualMemory through its hooked prologue, you:
- Resolve the SSN yourself: the System Service Number (the syscall index) can be extracted by sorting Zw* exports by address, completely avoiding the hooked bytes
- Jump directly to the
syscall;retinstruction inside ntdll, at offset +0x12 of the target function's stub
Your assembly stub looks like this:
mov r10, rcx ; first arg (kernel reads from R10, not RCX)
mov eax, 0x18 ; SSN for NtAllocateVirtualMemory
jmp [gadget_addr] ; jump to ntdll!NtAllocateVirtualMemory+0x12
; where the "syscall; ret" sequence lives
The hook at the function's start is completely bypassed. The syscall instruction still executes from within ntdll's memory (so it looks legitimate to the kernel). The SSN is correct. The call succeeds.
But there's a catch.
The catch: your call stack is wrong
When the EDR captures the call stack during the syscall, it sees:
ntdll!NtAllocateVirtualMemory+0x14 ← ok, this is ntdll
your_shellcode.bin+0x?? ← ALERT: unknown memory, not backed by any DLL
Your code is in MEM_PRIVATE memory (heap-allocated, not backed by any file on disk). It doesn't have a .pdata section. It doesn't have RUNTIME_FUNCTION entries. The EDR sees an address that doesn't belong to any known module and flags it immediately.
This is where fake stack frames come in. But before we get there, there's another problem.
The limit of indirect syscalls: not everything is a syscall
Indirect syscalls work great for kernel operations: memory allocation (NtAllocateVirtualMemory), memory protection (NtProtectVirtualMemory), thread creation (NtCreateThreadEx). These operations have direct Nt* equivalents because they are pure kernel calls.
But a real implant does more than allocate memory and create threads. It reads files, loads DLLs, resolves functions, opens network connections, queries the registry. These operations use Win32 APIs that have no syscall equivalent:
APIs with NO syscall equivalent (must use Win32):
| Win32 API | Why |
|---|---|
LoadLibraryA, GetProcAddress | User-mode loader operations. There is no NtLoadLibrary. DLL loading (PE parsing, import resolution, DllMain) happens entirely in user-mode. |
WinHttpOpen, socket, connect, send | Implemented in user-mode DLLs (winhttp.dll, ws2_32.dll). Multiple internal syscalls, no single Nt* replacement. |
CreateToolhelp32Snapshot | Snapshot enumeration API with no single Nt* equivalent. Internally uses NtQuerySystemInformation but with complex user-mode logic. |
GetComputerNameW, GetUserNameW | Read from user-mode structures (registry, token). No direct kernel call. |
APIs where Nt* exists but with different parameter semantics:
| Win32 API | Nt* equivalent | Why you might still use Win32 |
|---|---|---|
CreateFileW | NtCreateFile | Requires NT path format (\??\C:\...), OBJECT_ATTRIBUTES, IO_STATUS_BLOCK. Doable but verbose. |
ReadFile, WriteFile | NtReadFile, NtWriteFile | Same structure difference. Works fine for indirect syscall if you handle the setup. |
RegOpenKeyExW | NtOpenKey | Requires OBJECT_ATTRIBUTES with full NT registry path. |
OpenProcessToken | NtOpenProcessToken | Direct equivalent, easy to use via syscall. |
You cannot replace LoadLibraryA with a syscall. There is no NtLoadLibrary. The DLL loading logic (parsing PE headers, resolving imports, calling DllMain, walking the loader lock) happens entirely in user-mode inside ntdll's loader. You must call the actual Win32 function.
Same for network. WinHttpSendRequest is implemented in winhttp.dll. It eventually makes syscalls internally, but you can't replicate its behavior with a single syscall instruction.
This creates two distinct problems that require two distinct solutions:
Problem 1: Nt* syscalls (VirtualAlloc, CreateThread, etc.)
Your indirect syscall stub bypasses the hook, but the call stack shows your unbacked code. Solution: build fake stack frames before the syscall to hide your stub behind legitimate module addresses.
Problem 2: Win32 APIs with no syscall equivalent (LoadLibrary, CreateFile, etc.)
You must call the real function in the real DLL. The function executes legitimately, but the call stack still shows your unbacked code as the caller. Solution: callstack spoofing, where you build fake frames and then CALL the real Win32 API so that the stack looks like the call came from a legitimate module.
Both techniques forge fake frames on the stack. Both need exact frame layouts from .pdata. But they work differently:
| Fake stack frames (indirect syscall) | Callstack spoofing (Win32 API) | |
|---|---|---|
| Target | Nt* functions with direct syscall | Win32 APIs with no syscall equivalent |
| Execution | JMP to syscall;ret gadget in ntdll | CALL to the real Win32 function |
| Hook bypass | Yes, the hook is completely skipped | No, the function runs normally (hook included) |
| Why spoof | Hide your stub from the stack walk | Hide your code as the caller |
| Return addr | Manually placed on stack before JMP | CPU pushes it automatically on CALL |
| Examples | NtAllocateVirtualMemory, NtProtectVirtualMemory | CreateFileW, LoadLibraryA, WinHttpOpen |
Both techniques need the same thing from Kagura: the exact frame layout of each function in the chain.
Part IV: The EDR's Six Checks
What the EDR validates
This isn't as simple as pushing a few addresses onto the stack. Modern EDRs perform deep validation on every frame. Here's what CrowdStrike Falcon actually checks:
Check 1: Backing verification. Every return address must point inside a MEM_IMAGE region (a module loaded from a file on disk). If it points to MEM_PRIVATE (heap, VirtualAlloc'd memory), it's unbacked code. Flag.
Check 2: RUNTIME_FUNCTION lookup. Every return address must fall within a function that has a valid .pdata entry. The EDR calls RtlLookupFunctionEntry() for each address. No entry = suspicious frame.
Check 3: Call site verification. The instruction before each return address must be a CALL instruction (opcodes E8 for relative call, FF /2 for indirect call (call rax, call [rax], call [rip+disp32])). If the bytes before the return address aren't a CALL, the frame wasn't created by a real function call. Flag.
Check 4: Unwind replay. The EDR replays the function's UNWIND_CODE entries to calculate the expected frame size. If your fake frame is 0x40 bytes but the unwind codes say the function creates a 0x60 frame, the sizes don't match. Flag.
Check 5: RSP alignment. RSP must be 16-byte aligned at each frame boundary. The stack must fall within the thread's [StackLimit, StackBase] range from the TEB (Thread Environment Block, a per-thread structure that stores stack bounds and other thread metadata).
Check 6: Chain termination. The stack walk must end at ntdll!RtlUserThreadStart with a return address of 0x0000000000000000. If it doesn't terminate cleanly, the chain is incomplete. Flag.
Six checks. Every frame. Every syscall. Miss one, and the entire spoof fails.
Why the frame layout matters
Check 4 is the killer. The EDR doesn't just verify that a return address exists. It calculates where that address should be based on the function's unwind metadata.
Here's how: the .pdata section of every PE contains an array of RUNTIME_FUNCTION entries. Each entry points to an UNWIND_INFO structure that describes the function's prologue, specifically how it sets up its stack frame.
RUNTIME_FUNCTION for CreateFileA:
BeginAddress: 0x00045AB0
EndAddress: 0x00045B30
UnwindData: → UNWIND_INFO {
Version: 1
Flags: 0 (no handler)
SizeOfProlog: 31 bytes
CountOfCodes: 12
FrameRegister: 0 (none)
UnwindCodes: [
UWOP_PUSH_NONVOL(RBP) → RSP -= 8
UWOP_PUSH_NONVOL(R14) → RSP -= 8
UWOP_PUSH_NONVOL(R15) → RSP -= 8
UWOP_ALLOC_SMALL(0x60) → RSP -= 0x60
UWOP_SAVE_NONVOL(RBX, 0x80)→ [RSP+0x80] = RBX
UWOP_SAVE_NONVOL(RSI, 0x88)→ [RSP+0x88] = RSI
UWOP_SAVE_NONVOL(RDI, 0x90)→ [RSP+0x90] = RDI
UWOP_SAVE_NONVOL(R12, 0x98)→ [RSP+0x98] = R12
]
}
The unwinder replays these codes:
3 pushes × 8 bytes = 0x18
+ 0x60 allocation = 0x60
——————
return_addr_offset = 0x78
total_frame_size = 0x80 (0x78 + 8 for the return address itself)
If your fake frame for CreateFileA is not exactly 0x80 bytes, with the return address at exactly offset +0x78, the unwind replay produces different RSP values. The cross-frame RSP check fails. Detection.
Knowing the exact frame layout is not optional; it's what separates a working spoof from a dead process.
Part V: Fake Stack Frames
The anatomy of a forged stack
Let's build a complete fake stack for an indirect NtAllocateVirtualMemory syscall. The EDR expects to see this call chain:
ntdll!NtAllocateVirtualMemory ← the gadget (syscall;ret)
kernelbase!VirtualAlloc ← pretend this called the syscall
kernel32!BaseThreadInitThunk ← pretend this started the thread
ntdll!RtlUserThreadStart ← chain terminator
Each of these is a real function with real unwind data. We need to forge a frame for each one that matches what RtlVirtualUnwind would calculate.
This is where I open Kagura.
Kagura at startup: 7 modules loaded, 20,000+ functions indexed and ready to search.
Step 1: NtAllocateVirtualMemory (the syscall stub)
Search: NtAllocateVirtualMemory in ntdll.
Size: 0x8 (8B) | Prolog: 0B | Unwind codes: 0
NtAllocateVirtualMemory: a syscall stub with no frame. Kagura shows the spoofed chain and exact call-site offsets.
A syscall stub. No frame to forge here. It's just 8 bytes (the return address pushed by CALL). Kagura flags it:
NOTE: This is a syscall stub (no prologue, no frame setup).
For stack spoofing, you don't forge THIS frame. You forge the CALLER's frame.
The return address in this frame must point inside the Win32 wrapper. Kagura tells me the call sites:
kernelbase.dll!VirtualAlloc+0x44 (return addr after CALL #1)
So our first frame is 0x8 bytes, with the return address pointing to kernelbase_base + 0xB7C80 + 0x44.
Step 2: VirtualAlloc (the Win32 wrapper)
VirtualAlloc's frame: 0x40 bytes, shadow space only. Kagura found 3 CALL sites via Zydis.
Search: VirtualAlloc in kernelbase.
Size: 0x40 (64B) | Prolog: 4B | Unwind codes: 1
RSP+0x0038 | Return address (@0x38) ← caller RIP
RSP+0x0000 | Local/Shadow (0x38) ← can be zero-filled
Simple frame. 0x40 bytes total. No saved registers, just shadow space. The return address goes at offset +0x38 and must point inside BaseThreadInitThunk.
Step 3: BaseThreadInitThunk (thread entry)
BaseThreadInitThunk: the thread entry frame. One saved register (RBX), 0x30 bytes total.
Search: BaseThreadInitThunk in kernel32.
Size: 0x30 (48B) | Prolog: 6B | Unwind codes: 2
RSP+0x0028 | Return address (@0x28) ← caller RIP
RSP+0x0020 | Saved RBX (PUSH) ← needs a plausible value
RSP+0x0000 | Local/Shadow (0x20) ← zero-fill
One saved register (RBX at +0x20, pushed). The return address at +0x28 points into RtlUserThreadStart. RBX needs a plausible non-zero value (the EDR may validate it during RBP-chain walking).
Step 4: RtlUserThreadStart (chain terminator)
RtlUserThreadStart: the chain terminator. Return address = 0x0000000000000000.
Search: RtlUserThreadStart in ntdll.
Size: 0x50 (80B) | Prolog: 4B | Unwind codes: 1
RSP+0x0048 | Return address (@0x48) ← 0x0000000000000000 (END)
RSP+0x0000 | Local/Shadow (0x48) ← zero-fill
Terminal frame. The return address is zero, which signals the end of the call chain. The unwinder stops here.
The complete forged stack
Now stack all four frames contiguously in memory:
MEMORY LAYOUT (low address → high address)
Offset 0x00 ┌— Frame 0: NtAllocateVirtualMemory (0x08 bytes) —┐
│ +0x00: Return addr → kernelbase!VirtualAlloc+0x44 │
└————————————————————————┘
Offset 0x08 ┌— Frame 1: VirtualAlloc (0x40 bytes) —————┐
│ +0x00: Shadow/locals (0x38 bytes, zero-filled) │
│ +0x38: Return addr → kernel32!BaseThreadInit+0x14 │
└————————————————————————┘
Offset 0x48 ┌— Frame 2: BaseThreadInitThunk (0x30 bytes) ——┐
│ +0x00: Shadow/locals (0x20 bytes, zero-filled) │
│ +0x20: Saved RBX = 0x0000000100000000 (plausible) │
│ +0x28: Return addr → ntdll!RtlUserThreadStart+0x21 │
└————————————————————————┘
Offset 0x78 ┌— Frame 3: RtlUserThreadStart (0x50 bytes) ———┐
│ +0x00: Shadow/locals (0x48 bytes, zero-filled) │
│ +0x48: Return addr → 0x0000000000000000 (END) │
└————————————————————————┘
Total: 0xC8 bytes (200 bytes)
In C:
BYTE fake_stack[0xC8] = {0}; // zero-fill everything
// Frame 0: syscall stub return → VirtualAlloc
*(PVOID*)(fake_stack + 0x00) = kernelbase_base + 0xB7C80 + 0x44;
// Frame 1: VirtualAlloc return → BaseThreadInitThunk
*(PVOID*)(fake_stack + 0x08 + 0x38) = kernel32_base + 0x2E8C0 + 0x14;
// Frame 2: BaseThreadInitThunk
*(ULONG64*)(fake_stack + 0x48 + 0x20) = 0x0000000100000000; // fake RBX
*(PVOID*)(fake_stack + 0x48 + 0x28) = ntdll_base + 0x8C460 + 0x21;
// Frame 3: RtlUserThreadStart (terminal)
*(PVOID*)(fake_stack + 0x78 + 0x48) = (PVOID)0; // return addr = 0 = END
Every number in this code came directly from Kagura: offsets, frame sizes, return addresses. The +0x44 for VirtualAlloc's call site was found by Zydis scanning the function for CALL instructions. The +0x14 and +0x21 are the well-known offsets for BaseThreadInitThunk and RtlUserThreadStart (stable across Windows 10/11 builds).
What the EDR sees when it walks this stack
The unwinder starts at frame 0:
- RIP =
ntdll!NtAllocateVirtualMemory+0x14→ MEM_IMAGE, has RUNTIME_FUNCTION → PASS - Reads return address at RSP+0x00 →
kernelbase!VirtualAlloc+0x44→ MEM_IMAGE → PASS - Checks bytes before +0x44: finds
E8 xx xx xx xx(CALL instruction) → PASS - Looks up VirtualAlloc's UNWIND_INFO → frame size 0x40 → advances RSP by 0x40
- Reads return address at new RSP+0x38 →
kernel32!BaseThreadInitThunk+0x14→ PASS - BaseThreadInitThunk: UNWIND_INFO says frame = 0x30, saved RBX at +0x20. Reads RBX value → non-zero, plausible → PASS
- Advances RSP by 0x30 →
ntdll!RtlUserThreadStart+0x21→ PASS - RtlUserThreadStart: frame = 0x50, return address at +0x48 →
0x0000000000000000→ chain terminated. All frames validated.
Every check passes. The stack walk completes cleanly. Your real code (the indirect syscall stub in unbacked memory) is invisible. It was never part of the forged chain.
Scenario 2: Callstack spoofing for a Win32 API (CreateFileW)
Now let's look at the other case. Your implant needs to read a file from disk. NtCreateFile exists but requires NT path format (\??\C:\file.txt), OBJECT_ATTRIBUTES structs, and IO_STATUS_BLOCK setup. For many implant operations (config files, staged payloads, exfiltration), calling the real CreateFileW through kernelbase is simpler and more reliable.
The problem: when CreateFileW executes, the EDR captures the call stack and sees your unbacked code as the caller:
kernelbase!CreateFileW+0x?? ok, legitimate module
your_implant.exe!read_config+0x?? ALERT: unbacked caller
The solution: build fake frames before calling CreateFileW so the EDR sees a clean caller chain. But the mechanism is different from the indirect syscall case.
The key difference: you CALL the function, you don't JMP to a gadget.
With indirect syscalls, your stub does JMP to a syscall;ret gadget. The function never truly "runs". With callstack spoofing for Win32 APIs, you do a real CALL to CreateFileW. The function runs entirely, with all its internal logic. You're not bypassing anything. You're just hiding who the caller is.
Your stub looks like:
; Build 3 fake RBP-chain frames (same technique as indirect syscall)
push 0 ; Frame 3: RBP = 0 (chain end)
push QWORD PTR [gadget_ntdll] ; return addr in ntdll
lea rax, [rsp + 8]
push rax ; Frame 2: RBP -> Frame 3
push QWORD PTR [gadget_kernel32] ; return addr in kernel32
lea rax, [rsp + 8]
push rax ; Frame 1: RBP -> Frame 2
push QWORD PTR [gadget_kernelbase] ; return addr in kernelbase
lea rbp, [rsp + 8] ; link RBP chain
sub rsp, 0x38 ; shadow space + alignment
; Arguments for CreateFileW (7 args: RCX, RDX, R8, R9, + 3 on stack)
; ... set up RCX, RDX, R8, R9 ...
; ... place args 5,6,7 on stack ...
call QWORD PTR [pCreateFileW] ; CALL the real function
; CPU pushes return addr automatically
; CreateFileW runs normally
; returns here with result in RAX
; cleanup: restore RSP, RBP
Notice: CALL, not JMP. The CPU pushes the return address automatically. CreateFileW executes its full code path (including any EDR hooks on lower-level functions it calls internally). When the EDR stack-walks during CreateFileW's execution, it sees the fake frames above your code.
Important caveat: the stub shown above uses RBP-chain linking, which is a simplified illustration. In practice, a .pdata-based unwinder (like CrowdStrike's) does not follow RBP chains. It looks up RUNTIME_FUNCTION entries for each return address. For this technique to pass .pdata validation, the gadget return addresses must point into functions with valid, simple UNWIND_INFO (few unwind codes, no exception handlers, no frame pointer). The gadget selection process described in the Shinkirō post covers this in detail.
What chain do we need to forge? Open Kagura, search CreateFileW in kernelbase:
CreateFileW: 0x80 frame, 7 saved registers (4 MOV + 3 PUSH). The forge diagram shows exactly what to put where.
kernelbase!CreateFileW (0x00025A10)
Size: 0x80 (128B) | Prolog: 31B | Unwind codes: 12
RSP+0x0098 | Saved R12 (MOV @0x98)
RSP+0x0090 | Saved RDI (MOV @0x90)
RSP+0x0088 | Saved RSI (MOV @0x88)
RSP+0x0080 | Saved RBX (MOV @0x80)
RSP+0x0078 | Return address (@0x78)
RSP+0x0070 | Saved R15 (PUSH)
RSP+0x0068 | Saved R14 (PUSH)
RSP+0x0060 | Saved RBP (PUSH)
RSP+0x0000 | Local/Shadow (0x60)
This is a bigger frame than VirtualAlloc. More registers saved, more offsets to get right. But Kagura gives you everything.
The critical difference in what you forge:
| Indirect syscall (NtAllocateVirtualMemory) | Callstack spoofing (CreateFileW) | |
|---|---|---|
| Frame you forge | The Win32 wrapper's frame (VirtualAlloc) | The caller's frame (whoever "called" CreateFileW) |
| Why | The syscall stub has no frame, so you fake the wrapper above it | CreateFileW has its own real frame, you fake who called it |
| Return addresses | Point to call sites inside real functions (Zydis finds them) | Same: point to call sites inside real functions |
| Function execution | Never runs (you JMP past the hook to syscall;ret) | Runs fully (you CALL it, it does its job) |
Both techniques need the same data from Kagura: frame sizes, return address offsets, saved register locations, and valid CALL sites. The .pdata layout is the common foundation.
Part VI: Why I Built Kagura
The manual process is broken
Before Kagura, building the fake stack above required:
- Open WinDbg, attach to a process
- Run
.fnent kernelbase!VirtualAllocto get the UNWIND_INFO - Mentally decode each UNWIND_CODE to calculate the frame size
- Run
dumpbin /unwindinfo kernelbase.dllto verify - Disassemble VirtualAlloc to find a CALL instruction for the return address
- Repeat for BaseThreadInitThunk
- Repeat for RtlUserThreadStart
- Hope you didn't make an arithmetic error somewhere
For one syscall wrapper. Now multiply by every Nt* function your implant uses. NtProtectVirtualMemory, NtWriteVirtualMemory, NtCreateThreadEx, NtMapViewOfSection, NtQueueApcThread... each one needs its own forged chain. Each chain needs its own validated frame layouts.
This manual process doesn't scale. It's error-prone. It's slow. And when you get an offset wrong, the debugging experience is miserable: CrowdStrike doesn't tell you which check failed, it just kills your process.
I built Kagura primarily as a research and visualization tool. When I was developing Shinkirō's fake stack frames, I needed a clear picture of what the target stack was supposed to look like before writing the assembly to construct it. Kagura gave me that picture. In a real engagement, you'll still want to parse .pdata and resolve gadgets programmatically at runtime (hardcoded offsets break across Windows builds). But having a tool that shows you the exact frame layout, register positions, and valid call sites before you write a single line of code makes the difference between guessing and knowing what you're building toward.
What Kagura does
Kagura-StackWalker loads the 7 most commonly used system DLLs at startup:
| Module | Why it matters |
|---|---|
| ntdll.dll | All Nt* syscall stubs, Rtl* utilities |
| kernel32.dll | BaseThreadInitThunk (terminal frame), Win32 forwarders |
| kernelbase.dll | Actual Win32 implementations (VirtualAlloc, CreateFile, etc.) |
| user32.dll | Window/message functions |
| win32u.dll | Win32k syscall stubs |
| advapi32.dll | Security, registry, service functions |
| msvcrt.dll | C runtime (common in legitimate stacks) |
For each module, Kagura:
- Parses every
RUNTIME_FUNCTIONfrom the.pdatasection - Decodes every
UNWIND_INFOand replays theUNWIND_CODEarray - Reconstructs the complete stack frame layout: every slot, every offset, every saved register
- Scans every function with Zydis to find CALL instruction sites
- Indexes everything for instant search
Over 20,000 functions, all available in an interactive TUI. Type /NtAllocate, and in milliseconds you see the complete frame layout with forge instructions and a suggested spoofed call stack. No WinDbg. No dumpbin. No arithmetic.
Searching "NtAllo" in ntdll: 4 matches including NtAllocateVirtualMemory. RVA and frame size visible at a glance.
PUSH vs MOV: the detail that Kagura makes visible
There are two ways a function saves registers, and getting them confused will break your spoof:
PUSH (UWOP_PUSH_NONVOL): the register is saved by a push instruction. The offset is sequential, determined by the order of pushes.
MOV (UWOP_SAVE_NONVOL): the register is saved by mov [rsp+offset], reg. The offset is explicit; it can be anywhere in the frame.
CreateFileA: 4 registers saved by MOV (explicit offsets), 3 by PUSH (sequential). The distinction matters when forging.
Kagura shows this for every register:
+-- Registers saved --------------------+
| R12 @ RSP+0x0098 (MOV) | ← explicit offset, anywhere in frame
| RDI @ RSP+0x0090 (MOV) |
| RSI @ RSP+0x0088 (MOV) |
| RBX @ RSP+0x0080 (MOV) |
| RBP @ RSP+0x0060 (PUSH) | ← sequential, order matters
| R14 @ RSP+0x0068 (PUSH) |
| R15 @ RSP+0x0070 (PUSH) |
+----------------------------------------+
When you forge this frame, the MOV-saved registers need plausible values at their exact offsets. The PUSH-saved registers are sequential. Get the method wrong and the unwinder reconstructs wrong register values. Some EDRs validate this.
One more thing: sleeping thread stacks
Everything above covers active call stacks (during a syscall or API call). But there's another scenario that matters just as much in 2026: sleeping threads.
When your implant sleeps between C2 callbacks (WaitForSingleObject, Sleep, NtDelayExecution), the thread's stack sits in memory for seconds or minutes. EDR products like CrowdStrike periodically scan sleeping thread stacks looking for unbacked return addresses. If your thread is sleeping with your shellcode's address visible in the stack, you get caught even though you're not actively doing anything.
The solution is thread stack spoofing: before sleeping, replace the real stack with legitimate-looking frames (the same kind Kagura shows you), encrypt your real code in memory, then restore everything when the sleep ends. Techniques like Ekko, Zilean, and Nighthawk-style stack masking all need the exact same frame layout data that Kagura provides. The sleep chain typically looks like WaitForSingleObjectEx → BaseThreadInitThunk → RtlUserThreadStart. Kagura gives you each of those frames.
Part VII: What's Next
Kagura v1 covers the essential: frame layout visualization and call-site resolution. But the vision is bigger.
Gadget finder: not every function is a good candidate for stack spoofing. Ideal gadget functions have simple unwind metadata (few codes, no exception handlers, no frame pointer). Kagura already has all the data to score every function. A future version will rank them.
Chain builder: given a target syscall, automatically assemble the complete multi-frame fake stack with validated RSP alignment and cross-frame consistency. No manual calculation.
Walk simulator: replay the EDR's stack walk logic against your forged stack before you deploy. Catch errors at development time, not in a live engagement.
Export to C structs: generate ready-to-compile FORGED_STACK structures with pre-calculated offsets. Copy, paste, compile, run.
Each of these builds on the 20,000+ frame layouts Kagura already collects. The foundation is solid.
Closing thoughts
The offensive security community has excellent research on syscall evasion. Hell's Gate, Halo's Gate, SysWhispers, SilentMoonwalk, Shinkirō. Each one pushed the state of the art forward. But every technique, no matter how sophisticated, starts with the same question:
What does this function's stack frame look like?
The answer lives in the .pdata section. In the UNWIND_INFO structures. In the UNWIND_CODE arrays that describe, byte by byte, how each function sets up its workspace on the stack.
Kagura makes that answer immediate. It decodes what the EDR expects to see and shows it to you, frame by frame, slot by slot.
Clone it, run it against your target DLLs, and start building. The frame layouts are waiting.
Source code: github.com/En3nr4/Kagura-StackWalker
Find me on LinkedIn // Enenra
References
- Microsoft, x64 exception handling, MSDN
- Microsoft, RUNTIME_FUNCTION structure, MSDN
- Microsoft, UNWIND_INFO structure, MSDN
- CrowdStrike, Stack Walking and RtlVirtualUnwind internals
- Doubling Down: ETW Callstacks, Elastic Security Labs
- Hell's Gate, am0nsec & smelly__vx, 2020
- Halo's Gate, Sektor7, 2021
- SysWhispers3, klezVirus, 2022
- SilentMoonwalk, kleiton0x00, 2023
- LoudSunRun, susMdT, 2024
- Shinkirō: Matched-Gadget Indirect Syscalls, Enenra, 2026
- Zydis: Fast x86/x64 Disassembler Library, Zyantific