Shinkiro: Matched-Gadget Indirect Syscalls With APC Execution

Prerequisites

This post assumes familiarity with:

x86-64 assembly and the Windows x64 calling convention
PE file format basics (exports, sections, .pdata)
Windows process memory layout (PEB, TEB, ntdll mapping)
General understanding of how EDR products instrument user-mode code

What is a syscall?

On Windows, every operation that touches the kernel (allocating memory, creating threads, modifying page protections) goes through a system call (syscall). User-mode code can't access kernel resources directly. Instead, it calls functions in ntdll.dll, which act as thin wrappers around the syscall instruction.

Each kernel function has a System Service Number (SSN), an integer index into the System Service Dispatch Table (SSDT). When ntdll!NtAllocateVirtualMemory executes, it places the SSN in the EAX register, copies RCX to R10 (the kernel reads the first argument from R10, not RCX, because syscall overwrites RCX with the return address), and executes the syscall instruction:

NtAllocateVirtualMemory:
    mov r10, rcx          ; R10 = first argument
    mov eax, 0x18         ; SSN for NtAllocateVirtualMemory (varies by Windows build)
    syscall               ; transition to kernel
    ret                   ; return to caller

This stub is 20 bytes (+0x00 to +0x14). The syscall sits at offset +0x12, and the ret at +0x14. Every Nt* function in ntdll follows this exact same layout.

Why EDRs hook ntdll

Endpoint Detection and Response products need visibility into what processes are doing. The most common approach is to overwrite the first bytes of Nt* functions in ntdll with a JMP to the EDR's own code. When your process calls NtAllocateVirtualMemory, it hits the JMP first, goes to the EDR's detour function which logs the call, then (if allowed) jumps back to execute the real syscall.

This is called userland hooking. It works because ntdll is mapped into every process's address space, and the EDR can modify it before your code runs.

What is an indirect syscall?

The idea is simple: if the EDR hooks the function prologue, don't use it. Instead:

Resolve the SSN yourself (by reading it from the stub, or by other means)
Execute the syscall instruction from a different location, either from your own code (direct syscall) or by jumping to the syscall;ret sequence inside ntdll at a known offset (indirect syscall)

The "indirect" part means the syscall instruction is still physically located inside ntdll's .text section, not in your process's code. This matters because the kernel captures the instruction pointer when syscall executes. If it points to ntdll, it looks legitimate. If it points to your unsigned executable's memory, it's immediately suspicious.

The four detection layers

Over the past few years, EDR vendors have developed four distinct detection mechanisms for indirect syscalls. Understanding all four is essential because Shinkiro was designed to address each one simultaneously.

Layer 1: Userland hook detection

If you read the SSN from a hooked function's prologue, you get the JMP opcode instead of mov eax, SSN. Hell's Gate (2020) detects this by checking if the first bytes are 4C 8B D1 B8 (the expected prologue). If hooked, Halo's Gate searches neighboring function stubs. The weakness: if many functions are hooked, the search range may not be sufficient.

Layer 2: Call stack symbol analysis (Elastic Security)

When a syscall occurs, the kernel captures the thread's call stack. EDRs inspect the symbol info field, a string like ntdll.dll!NtAllocateVirtualMemory+0x14 that identifies which function contains the instruction that was executing.

Elastic Security maintains whitelists. Their rule defense_evasion_windows_api_call_via_indirect_random_syscall checks: if the call stack contains ntdll!<function>+0x14 and that function name is NOT in the expected list for the API being called, kill the process.

This is the rule that burns gadget rotation techniques (SysWhispers3, etc.): using a syscall;ret from NtQueryEvent to execute NtAllocateVirtualMemory produces ntdll!NtQueryEvent+0x14 in the stack. Not whitelisted. Process killed.

Layer 3: Stack walking and unwind validation (CrowdStrike)

CrowdStrike Falcon goes deeper than symbol name matching. On every sensitive syscall, Falcon captures the full call stack and performs RtlVirtualUnwind-based validation. This means it doesn't just check the return addresses; it verifies the entire frame chain using the .pdata section (Runtime Function Table) of each module.

Each legitimate function in a PE has a RUNTIME_FUNCTION entry that describes how to unwind its stack frame (how much stack space it uses, which registers it saves, etc.). CrowdStrike's stack walker verifies that every return address in the chain falls within a function that has valid unwind data. If a return address points to an address in ntdll or kernelbase that has no corresponding RUNTIME_FUNCTION entry, the frame is flagged as synthetic.

Additionally, CrowdStrike monitors EtwTiLogSetContextThread events. Any technique that modifies debug registers (DR0-DR7) via NtSetContextThread fires this kernel-level telemetry event. This is how VEH + HWBP techniques get caught: the hardware breakpoint setup itself generates the alert, regardless of how clean the subsequent syscall looks.

This is why 3-level fake stack frames in Shinkiro use call reg (FF D0-D7) gadget addresses that are naturally preceded by a CALL instruction inside real DLL code. The return address passes both the "call-before-return" byte check and the RtlVirtualUnwind validation because it falls within a real function's code range.

Layer 4: Behavioral injection pattern

EDRs correlate the sequence of syscalls: NtAllocateVirtualMemory → NtWriteVirtualMemory → NtProtectVirtualMemory(RX) → NtCreateThreadEx is the canonical shellcode injection pattern. Each syscall individually looks fine, but the sequence triggers behavioral rules.

Existing techniques and their failure points

Technique	Year	Layer 1 (Hooks)	Layer 2 (Elastic)	Layer 3 (CrowdStrike)	Layer 4 (Behavior)
Hell's Gate	2020	Partial	Fail	Fail	Fail
Halo's Gate	2021	Better	Fail	Fail	Fail
FreshyCalls	2022	Pass	Fail	Fail	Fail
SysWhispers3	2022	Partial	Fail	Fail	Fail
HWSyscalls	2022	Partial	Pass	Fail	Fail
Tartarus' Gate	2023	Good	Fail	Fail	Fail
RecycledGate	2023	Good	Fail	Fail	Fail
Acheron (Go)	2023	Pass	Fail	Fail	Fail
SilentMoonwalk	2023	Partial	Fail	Pass	Fail
VEH + HWBP (LayeredSyscall)	2024	Partial	Pass	Fail	Fail
LoudSunRun	2024	Partial	Partial	Pass	Fail
BeaconGate	2024	Partial	Partial	Pass	Fail
Unwinder (Rust)	2024	Partial	Fail	Pass	Fail
Call Gadgets (Almond)	2025	Partial	Pass	Partial	Fail
Shinkiro	2026	Pass	Pass	Pass	Pass

No previously published technique passes all four layers. Shinkiro addresses each one through a dedicated component: Zw* sorting (L1), matched gadgets (L2), .pdata-aware fake frames (L3), and APC execution (L4). The closest are BeaconGate and LoudSunRun which handle Layer 3 (CrowdStrike stack walking) through synthetic frames, but neither implements matched-function gadgets (Layer 2) or alternative execution primitives (Layer 4).

Shinkiro: design

I developed Shinkiro (蜓気楼, "mirage" in Japanese) to address all four layers simultaneously. The technique emerged from months of research into how each major EDR product actually validates syscall behavior at the kernel level, and from the observation that no existing public technique attempted to solve all four detection surfaces in a unified design.

The architecture is built around five components.

Component 1: SSN resolution via Zw* export sorting

The FreshyCalls technique remains the strongest approach for SSN resolution. Shinkiro enumerates all Zw* exports from ntdll's Export Address Table, sorts them by address, and uses the sorted index as the SSN.

This works because EDR hooks modify function prologues but never touch the Export Address Table addresses. Modifying the EAT would corrupt the PE structure. Even with 100% of Nt* functions hooked, the Zw* addresses remain valid.

// Enumerate Zw* exports, sort by address, index = SSN
for (DWORD i = 0; i < numNames; i++) {
    const char* name = (const char*)(base + nameRVAs[i]);
    if (name[0] == 'Z' && name[1] == 'w') {
        table.entries[count].hash = djb2(name);
        table.entries[count].addr = (PVOID)(base + funcRVAs[ordinals[i]]);
        count++;
    }
}
insertion_sort(table.entries, count);
for (DWORD i = 0; i < count; i++)
    table.entries[i].ssn = i;

Component 2: Matched-function gadget resolution

This is the core insight of Shinkiro and where it diverges from every public technique I've analyzed.

The standard approach (SysWhispers3, Layered Syscalls) uses a syscall;ret gadget from a different ntdll stub for each call. The rationale: gadget diversity prevents correlation. In practice, this is exactly what Elastic's rules detect.

Elastic's rule checks whether the function name in the call stack matches the expected function for the API being called. Using NtQueryEvent's gadget to execute NtAllocateVirtualMemory creates a mismatch. Alert. Process killed.

Shinkiro uses the opposite approach: for each wrapped Nt* function, find the syscall;ret at offset +0x12 in that function's own stub:

static PVOID FindMatchedGadget(PVOID ntBase, DWORD funcHash) {
    for (DWORD j = 0; j < numNames; j++) {
        if (djb2(name) != funcHash) continue;
        PBYTE func = base + funcRVAs[ordinals[j]];
        // Verify syscall;ret at +0x12 (prologue may be hooked, we don't care)
        if (func[0x12] == 0x0F && func[0x13] == 0x05 && func[0x14] == 0xC3) {
            return (PVOID)(func + 0x12);
        }
        return NULL;
    }
    return NULL;
}

The return address after execution is NtAllocateVirtualMemory+0x14 for an allocation call, NtProtectVirtualMemory+0x14 for a protect call. This is exactly what Elastic's whitelist expects. No mismatch, no alert.

Component 3: 3-level fake stack frames

Even with the correct gadget, the call stack must look complete. A legitimate VirtualAlloc call shows:

ntdll!NtAllocateVirtualMemory+0x14
kernelbase!VirtualAlloc+0x...
kernel32!BaseThreadInitThunk+0x...
ntdll!RtlUserThreadStart+0x...

Shinkiro constructs three fake frames before jumping to the gadget. The frames use CALL instruction gadget addresses found in kernelbase.dll, kernel32.dll, and ntdll.dll. These addresses are naturally preceded by a CALL instruction in the DLL's executable code, which satisfies Elastic's call-before-return verification.

But satisfying CrowdStrike requires more. CrowdStrike's RtlVirtualUnwind-based stack walker doesn't just check that the return address is inside a legitimate module. It reads the RUNTIME_FUNCTION entry for the function containing that address, parses the UNWIND_INFO structure, and uses the unwind codes to calculate the expected frame size. If the calculated frame size doesn't match the actual stack layout, the frame is flagged as synthetic.

The naive approach (picking the first call reg gadget in the module's .text section) puts the gadget in an arbitrary function. If that function has a complex unwind (multiple saved registers, large stack allocation, frame pointer), CrowdStrike's unwinder expects a specific frame layout that our synthetic frame doesn't provide.

Shinkiro solves this with .pdata-aware gadget selection. Instead of blindly scanning .text, the gadget finder:

Parses the module's .pdata section (the RUNTIME_FUNCTION table)
Skips chained UnwindData entries (bit 0 set), which are not real unwind info
Reads the UNWIND_INFO header and applies three hard filters: CountOfCodes <= 2 (trivial unwind), no frame register, and Flags == 0 (UNW_FLAG_NHANDLER only — rejects functions with exception or chained handlers)
Scans each qualifying function for CALL instruction patterns: call reg (FF D0-D7), call [rip+disp32] (FF 15), and call [reg] (FF 10-13, FF 16-17)
Uses multi-pass scanning with progressive size limits (256, 1024, 4096 bytes) — preferring small, simple functions first without ever relaxing the unwind criteria

PVOID FindGadget(PVOID moduleBase) {
    BYTE* base = (BYTE*)moduleBase;
    // ... PE header validation, .pdata lookup ...

    SK_RUNTIME_FUNCTION* rfTable = (SK_RUNTIME_FUNCTION*)(base + pdataRVA);
    DWORD numEntries = pdataSize / sizeof(SK_RUNTIME_FUNCTION);

    // Multi-pass: relax size progressively, NEVER relax unwind criteria
    DWORD maxSizes[] = { 256, 1024, 4096 };

    for (int pass = 0; pass < 3; pass++) {
        DWORD maxSize = maxSizes[pass];

        for (DWORD i = 0; i < numEntries; i++) {
            DWORD unwindRVA = rfTable[i].UnwindData;

            // Skip chained UnwindData (bit 0 = 1)
            if (unwindRVA & 1) continue;

            DWORD funcSize = rfTable[i].EndAddress - rfTable[i].BeginAddress;
            if (funcSize < 8 || funcSize > maxSize) continue;

            SK_UNWIND_INFO* uwi = (SK_UNWIND_INFO*)(base + unwindRVA);

            // HARD FILTER 1: CountOfCodes <= 2 (trivial unwind)
            if (uwi->CountOfCodes > 2) continue;
            // HARD FILTER 2: no frame register
            if ((uwi->FrameRegAndOffset & 0x0F) != 0) continue;
            // HARD FILTER 3: flags == 0 (UNW_FLAG_NHANDLER only)
            if ((uwi->VersionAndFlags >> 3) != 0) continue;

            BYTE* funcCode = base + rfTable[i].BeginAddress;
            for (DWORD j = 0; j + 2 < funcSize; j++) {
                if (funcCode[j] != 0xFF) continue;
                BYTE modrm = funcCode[j + 1];

                // call reg (FF D0-D7)
                if (modrm >= 0xD0 && modrm <= 0xD7)
                    return (PVOID)(funcCode + j + 2);
                // call [rip+disp32] (FF 15 XX XX XX XX)
                if (modrm == 0x15 && j + 6 <= funcSize)
                    return (PVOID)(funcCode + j + 6);
                // call [reg] (FF 10-13, FF 16-17)
                if ((modrm >= 0x10 && modrm <= 0x13) ||
                    modrm == 0x16 || modrm == 0x17)
                    return (PVOID)(funcCode + j + 2);
            }
        }
    }
    return NULL;  // no fallback — better to fail than use a detectable gadget
}

The three hard filters are never relaxed across passes. Only the function size limit increases (256 → 1024 → 4096), preferring the smallest possible functions first. This matters because smaller functions have simpler stack layouts, which are easier to satisfy with synthetic frames.

The flags check (VersionAndFlags >> 3 == 0) is critical and often overlooked. Functions with UNW_FLAG_EHANDLER or UNW_FLAG_UHANDLER have exception handling metadata after their unwind codes. CrowdStrike's unwinder processes this metadata during validation — a synthetic frame pointing into such a function would need to account for the handler data, which our minimal frame layout doesn't provide.

A module like kernel32.dll contains roughly 3,000 RUNTIME_FUNCTION entries. After filtering for trivial unwind (no frame register, 0-2 unwind codes, no handlers), approximately 100-200 functions remain. Scanning for CALL patterns in the smallest of these first makes the gadget search reliable and deterministic.

The assembly stub pattern remains identical:

; Save caller context
mov [savedRSP], rsp
mov [savedRBP], rbp

; FRAME 3 (ntdll, chain terminator)
push 0
push QWORD PTR [gadget_ntdll]

; FRAME 2 (kernel32)
lea rax, [rsp + 8]
push rax
push QWORD PTR [gadget_kernel32]

; FRAME 1 (kernelbase)
lea rax, [rsp + 8]
push rax
push QWORD PTR [gadget_kernelbase]

lea rbp, [rsp + 8]
sub rsp, <size>         ; exact size per function

lea rax, [cleanup]
mov [rsp], rax
mov r10, rcx
mov eax, [SSN]
jmp [matched_gadget]    ; ntdll!NtXxx+0x12

cleanup:
    mov rbp, [savedRBP]
    mov rsp, [savedRSP]
    ret

Critical detail: each Nt* function gets its own dedicated stub with the exact sub rsp size for its argument count. NtAllocateVirtualMemory (6 args, 2 on stack) uses sub rsp, 0x38. NtCreateThreadEx (11 args, 7 on stack) uses sub rsp, 0x60. No generic stub, no register shifting, no off-by-one stack errors.

Component 4: Per-function stubs

Shinkiro generates 7 separate assembly stubs:

Stub	NT Function	Args	Stack Args	sub rsp
stubAlloc	NtAllocateVirtualMemory	6	2	0x38
stubProtect	NtProtectVirtualMemory	5	1	0x30
stubWrite	NtWriteVirtualMemory	5	1	0x30
stubThread	NtCreateThreadEx	11	7	0x60
stubWait	NtWaitForSingleObject	3	0	0x28
stubApc	NtQueueApcThread	5	1	0x30
stubAlert	NtTestAlert	0	0	0x28

Each wrapper sets the SSN and gadget address in global variables, then calls its dedicated stub with the real Nt* arguments directly. No intermediate parameters to shift.

In a real-world implementation that wraps more NT functions (credential access, file operations, process manipulation), you would add one stub per additional function following the same pattern. The stub count scales linearly and the generator can produce them automatically based on a function signature table.

Component 5: APC-based execution

This is where Shinkiro breaks the behavioral injection pattern. Instead of NtCreateThreadEx (the most monitored syscall in the injection chain), Shinkiro uses APC injection on the current thread:

NtQueueApcThread(NtCurrentThread, shellcode_addr, NULL, NULL, NULL) queues the shellcode as an Asynchronous Procedure Call on the current thread
NtTestAlert() triggers execution of pending APCs

The EDR sees: Alloc → Write → Protect → QueueApc → TestAlert. This is NOT the Alloc → Write → Protect → CreateThread pattern that behavioral rules match. No new thread is created. The shellcode executes in the context of the existing thread via the APC dispatcher.

It's worth noting that APC-based injection is not a new concept. It has been used in process injection (cross-process APC to remote threads) for years. However, using it as a same-thread execution primitive combined with Shinkiro's indirect syscall infrastructure (matched gadgets, fake frames) is what makes it effective here. The APC call itself goes through the same fake-frame stub as every other syscall, so the call stack looks identical to a legitimate NtQueueApcThread call from kernelbase.

NtCreateThreadEx is kept as a fallback if APC resolution fails, but the primary execution path avoids it entirely.

CrowdStrike bypass

Shinkiro was tested against CrowdStrike Falcon in its Policy Prevention 3 (Aggressive) configuration. The combination of matched gadgets (correct function names in the call stack), fake frames (valid return addresses in legitimate DLLs), and zero hardware breakpoint usage (no EtwTiLogSetContextThread events) resulted in successful execution without behavioral alerts on the syscall layer.

The key factors:

No NtSetContextThread calls (no DR modifications, no kernel telemetry)
No VEH registration (no RtlAddVectoredExceptionHandler)
Call stack shows legitimate function names at expected offsets (NtXxx+0x14 for each matched gadget)
Fake frame gadgets are in functions with trivial UNWIND_INFO (.pdata-aware selection), so RtlVirtualUnwind calculates minimal expected frame sizes that match our synthetic layout
The injection sequence uses APC instead of CreateThread, avoiding the monitored pattern

The .pdata-aware gadget selection is what differentiates Shinkiro's fake frames from earlier techniques like SilentMoonwalk or LoudSunRun. Those implementations pick gadgets without considering unwind metadata. In testing, CrowdStrike's stack walker accepted Shinkiro's frames because the unwind codes for the selected gadget functions describe frames simple enough that our synthetic layout satisfies the validation. This is not a guarantee against future updates to CrowdStrike's unwinder, but it significantly raises the bar compared to arbitrary gadget selection.

Shinkiro executing against CrowdStrike Falcon — Policy Prevention 3 (Aggressive)

What Shinkiro does not solve

Unsigned module detection. The PE is not code-signed. EDR rules checking call_stack_final_user_module.code_signature.trusted == false will still fire.

Unbacked memory execution. The shellcode runs from VirtualAlloc-allocated memory not backed by any file on disk. Module stomping mitigates this.

PEB walking pattern. Shinkiro uses PEB walking with XOR-obfuscated offsets to find ntdll. The memory access pattern through PEB structures remains a behavioral signal for EDRs with deep instrumentation.

Hooked function resilience. An earlier version of Shinkiro verified the full stub prologue (4C 8B D1 B8) before extracting the matched gadget at +0x12. When the prologue was hooked (overwritten with a JMP by the EDR), the lookup failed entirely and fell back to a generic .text scan, triggering the "Unusual NTDLL Offset" alert.

The current implementation removes the prologue check. The key observation: EDR hooks overwrite the first 5-14 bytes of the stub (the prologue), but leave the syscall;ret sequence at offset +0x12 intact. The EDR needs that syscall instruction for its own execution chain. Since Shinkiro resolves SSNs via Zw* export sorting (not from the prologue), the hooked bytes are irrelevant.

The matched gadget resolution now only verifies three bytes: func[0x12] == 0x0F, func[0x13] == 0x05, func[0x14] == 0xC3. This works even when 100% of Nt* prologues are hooked. The only failure case is an EDR that patches the syscall instruction itself, a scenario not observed in any current product (CrowdStrike, Elastic, SentinelOne, MDE) as of March 2026.

For environments where even this level of resilience is insufficient, advanced ntdll unhooking techniques (mapping a fresh copy from disk, restoring the .text section from a suspended process) can be combined with Shinkiro to guarantee clean stubs. This will be the subject of a future post.

Building an evasive loader around Shinkiro

Shinkiro handles the syscall layer. But a production-grade loader needs more than clean syscalls. The syscall technique is one piece of a pipeline, and the other pieces matter just as much.

An example of a winning combination around Shinkiro:

Ntdll unhooking to restore a clean ntdll before anything else (NtCreateSection + NtMapViewOfSection from disk)
ETW bypass to reduce kernel telemetry (patchless approach via NtContinue, coexists with Shinkiro's zero-DR design)
API hashing + PEB walking with per-build randomized DJB2 salt and XOR-obfuscated PEB offsets to eliminate all static strings
Timing obfuscation with decoy syscalls and random delays between each operation to break temporal correlation

For operators who want to push it further, module stomping (overwriting a sacrificial DLL's .text section with the shellcode) eliminates the "unbacked memory" detection vector entirely. I tested this with Shinkiro's indirect syscalls and it works. The key detail: use DLLs from outside C:\Windows\System32\ to avoid Elastic's "Suspicious System Module Image Hollowing" rule.

One limitation of Shinkiro's current scope is that it only covers Nt* syscalls. Win32 API functions that have no direct syscall equivalent (CreateToolhelp32Snapshot, WinHttpOpen, CryptDecrypt, etc.) still go through standard DLL call chains and remain subject to call stack inspection. The same .pdata-aware fake frame construction used in Shinkiro can be adapted to spoof the call stack for these Win32 calls, ensuring that every frame in the chain passes both symbol validation and unwind verification. I'm currently working on this extension and will cover it in a dedicated post.

Each of these topics deserves its own deep dive. Future posts will cover ntdll unhooking internals, ETW bypass techniques, module stomping implementation details, and Win32 call stack spoofing.

Conclusion

When I started building Shinkiro, the core insight was straightforward: modern EDR detection logic is more precise than the offensive community often assumes. Elastic Security doesn't detect "indirect syscalls" as a category. It detects mismatches between the function name in the call stack and the API being called. CrowdStrike doesn't just check return addresses. It validates entire frame chains against unwind tables. Understanding these specifics changes the entire design: the goal is conformance, not randomization.

The APC execution component addresses the behavioral layer separately. By replacing NtCreateThreadEx with NtQueueApcThread + NtTestAlert, the injection chain pattern is broken without sacrificing functionality. APC-based execution is starting to appear on EDR vendor radars, but when every syscall in the chain goes through Shinkiro's fake-frame stubs with matched gadgets, the behavioral correlation between the APC queue and the preceding memory operations is significantly harder to establish.

The real value is not in any single component. It's in understanding what each EDR actually checks and engineering a solution that conforms to their expectations at every layer, rather than trying to hide from them.

Find me on LinkedIn // Enenra

References

FreshyCalls, Crummie5, 2022
Hell's Gate, am0nsec & smelly__vx, 2020
SysWhispers3, klezVirus, 2022
Elastic Protection Artifacts, behavioral rule source
Doubling Down: ETW Callstacks, Elastic Security Labs
SilentMoonwalk, kleiton0x00, 2023
Kagura-StackWalker: The Stack Is a Dance, Enenra, 2026
Kagami: The Three Layers of Full Sleep Obfuscation, Enenra, 2026