Kagami: The Three Layers of Full Sleep Obfuscation

Kagami (鏡) is the sacred mirror of Shinto shrines, the object that reflects the kami, the divine spirit, back at the observer. In Japanese folklore, a mirror reveals what is truly there, beneath the surface. In offensive security, the opposite is needed: during sleep, you want the mirror to reflect nothing. No code. No stack. No data. Just emptiness where a beacon used to be.

A modern beacon spends more than 95 percent of its life asleep. Sixty seconds of check-in time, fifty-nine minutes and forty seconds of waiting. During that waiting period, the beacon sits frozen in memory, and memory scanners have all the time they need to dismantle it. Every byte of code, every frame of the thread stack, every allocation on the heap is a potential indicator of compromise.

The best-known sleep obfuscation techniques (Ekko, Foliage, Zilean) are showing their age. EDRs have caught up: a simple code encryption pass no longer buys you a clean sleep. The stack still attributes you, the heap still betrays your config, and the protection-change telemetry lights up every correlation engine on the market. A new generation of sleep obfuscation has to address all three surfaces at once.

This post walks through the complete architecture of such a system. Three layers, three distinct problems, three solutions that only work when combined. I will show you what each layer closes, what it leaves open, and why skipping any single one collapses the entire evasion.

This is a theoretical deep dive. I will not release the source of my implant. But I will show enough of the mechanics, with partial code where necessary, to make the architecture reproducible by anyone who understands the Windows memory manager.

Part I: Why Sleep Is the Critical Moment

What actually happens when a beacon sleeps

When a beacon calls Sleep(60000) or any of its siblings, the following chain executes:

Sleep(60000)
  ├→ kernelbase!SleepEx
        ├→ ntdll!NtDelayExecution(FALSE, &interval)
              └→ syscall → KiSystemCall64 → nt!NtDelayExecution
                    └→ thread state = Waiting:DelayExecution

The kernel places the thread in a wait state. Its register context is saved. Its stack is frozen in place, exactly as it was at the moment of the call. For the next sixty seconds, nothing executes in that thread. But everything is still visible: the code section, the stack, the heap, the thread environment block, the module list.

The three attack surfaces

A process in memory has three distinct regions that can attribute it to a malicious beacon:

Surface A, the code section. The .text of the implant contains all the interesting logic: the syscall resolver, the C2 routines, the encryption primitives, the beacon loop. It sits in a MEM_PRIVATE region (if the beacon is shellcode) or MEM_IMAGE (if it is an EXE or uses module stomping). Any memory scanner with YARA rules will find signature matches. Any scanner that knows the Moneta checks will flag a MEM_PRIVATE region with executable permissions as "Private Executable Memory" or "Unbacked Shellcode" with high confidence.

Surface B, the thread stack. Every thread has a stack. When a thread sleeps on NtWaitForSingleObject, its stack contains the chain of return addresses that led it there. A scanner walks this stack. The deepest frame points to your module, in non-signed private memory. That pointer alone is enough to attribute the beacon. Hunt-Sleeping-Beacons does exactly this: enumerates every thread in Waiting:DelayExecution and walks its stack looking for return addresses inside MEM_PRIVATE.

Surface C, the heap. The heap contains the runtime data: the C2 configuration (URL, User-Agent, sleep delay, jitter), session keys, buffered command output, captured credentials. Even if the code is encrypted, the heap remains readable. PE-sieve dumps the entire process. Every string, every allocated block, every piece of config is there in plain text.

A memory scan at any moment during sleep reads all three surfaces simultaneously. Each is sufficient, on its own, to burn the beacon. A real sleep obfuscation system has to close all three.

Why one layer is not enough

It is tempting to think of sleep obfuscation as a single technique. Encrypt the code, done. But that only closes Surface A. The stack still attributes you. The heap still betrays your config. A thorough scanner sees all three, and the weakest layer is always the one that burns you.

The three layers are not additive. They are multiplicative. Each one closes a surface the others cannot reach. Skipping any single layer collapses the entire evasion because the scanner simply pivots to the remaining open surface.

The rest of this post walks through each layer in order: what it does, what it closes, what it leaves open, and how the next layer picks up where the previous one stops.

Part II: Layer 1, Code Encryption

The principle

The first layer encrypts the .text section of the beacon before sleeping, and decrypts it after waking. While sleeping, the code is unreadable gibberish. A memory scanner that reads the region finds high-entropy bytes instead of x86 instructions.

In pseudocode:

VirtualProtect(.text, size, PAGE_READWRITE)
XOR/RC4/AES the content with a key
NtWaitForSingleObject(timer)
XOR/RC4/AES to decrypt (same key)
VirtualProtect(.text, size, PAGE_EXECUTE_READ)

This is what Ekko, Foliage, Zilean, and most first-generation sleep obfuscators implement. It is the foundational layer, and every serious implant needs it.

The real cost: kernel callbacks on every protection change

The naive code above makes four calls to NtProtectVirtualMemory per sleep cycle:

RX        → RW         (so we can write the XOR)
RW        → NOACCESS   (hide the pages during sleep)
NOACCESS  → RW         (on wake, reopen for XOR)
RW        → RX         (restore executable)

Every one of these calls triggers a kernel callback. Not a userland hook. A kernel callback. EDR drivers register for these callbacks through ObRegisterCallbacks and receive notifications directly from the kernel every time a memory protection changes. The EDR sees:

The process handle
The base address of the region
The old protection and the new protection
A captured stack trace of who called NtProtectVirtualMemory

On top of that, Windows emits ETW events on the Microsoft-Windows-Kernel-Memory and Microsoft-Windows-Threat-Intelligence providers. EDR agents are subscribed to these from a kernel-mode ETW session. Even with a perfect indirect syscall that hides the userland call stack, the protection change itself is visible from the kernel side.

The pattern detectors built around this telemetry are specific and brutal:

Elastic Security flags the sequence RX → RW → RX on a MEM_PRIVATE region within a short time window as "Memory Protection Flip-Flop" with high confidence
CrowdStrike Falcon correlates the protection change with the return address of the caller. If the return address points to a non-image region, the threat score jumps significantly
Microsoft Defender for Endpoint classifies the PAGE_EXECUTE_READ → PAGE_READWRITE → PAGE_EXECUTE_READ pattern as "self-modifying code" by default

Four kernel callbacks per cycle. On a sleep of thirty seconds, that is eight callbacks per minute, or 480 per hour, on the same memory region. The correlation window closes, and the alert fires.

Who wakes up the sleeping beacon

A question most articles do not address. If the code is encrypted, what decrypts it? The encryption logic itself must live somewhere executable. This "something" is the wake-up mechanism, and it has to satisfy three constraints:

It must be executable at wake time (since the beacon's main code is encrypted)
It must contain the decryption routine
It must be triggered by something outside the encrypted code (because encrypted code cannot wake itself up)

Different approaches solve this differently:

Ekko uses a Timer Queue callback. CreateTimerQueueTimer schedules a function to run inside the Windows thread pool. When the timer fires, a system worker thread calls into a ROP chain built from gadgets inside signed DLLs, which performs the decryption.
Foliage uses an APC queued through NtQueueApcThread. When the beacon thread is alertable, the APC fires and executes the decryption ROP chain.
Waitable timer + APC routine: combines NtCreateTimer with an APC callback.

In every case, the wake-up mechanism is external to the encrypted code, because encrypted code cannot execute. This external trigger is always the residual, the part that cannot be hidden. It becomes the new target for the scanner.

What Layer 1 alone leaves open

With only Layer 1, several failures remain:

Failure 1: the stack still attributes you. Your code is encrypted, but the stack of your thread still contains return addresses that point to your module. The scanner does not need to disassemble. It just verifies that the address falls inside a MEM_PRIVATE region. It does. Detection.

Failure 2: the heap is naked. Config, keys, tokens, output buffers, all sit in plain text on the heap. Any dump reads them.

Failure 3: the wake-up trampoline is an indicator by itself. A timer queue callback that runs every thirty seconds, calling NtProtectVirtualMemory on the same region, is a textbook beacon signature.

Failure 4: the flip-flop telemetry. Four NtProtectVirtualMemory calls per cycle equals 480 per hour on the same region. No cryptography hides that from kernel-side observation.

Layer 1 is necessary. It is not sufficient. The next layer picks up the stack problem.

Part III: Layer 2, Thread Stack Spoofing

The principle

While the thread sleeps, rewrite the saved RBP and return addresses on its stack. Replace them with addresses that point to legitimate system functions. After the rewrite, a stack walk reads:

Frame 1: ntdll!NtWaitForSingleObject+0x14
Frame 2: kernelbase!WaitForSingleObjectEx+0x8E
Frame 3: kernel32!BaseThreadInitThunk+0x14
Frame 4: ntdll!RtlUserThreadStart+0x21

Every frame points inside a Microsoft-signed DLL loaded legitimately. None of them points into your module. The thread looks like a system worker waiting on a handle.

I covered the full theory of stack frame construction and the six EDR validation checks in the Kagura-StackWalker article. If you have not read it, the summary is this: you cannot just push random return addresses. Each address must pass RtlLookupFunctionEntry, the instruction before the return address must be a real CALL, the frame size must match the UNWIND_INFO recorded in the PE's .pdata section, and the whole chain must terminate at RtlUserThreadStart with a return address of zero.

Targeted resolution beats generic gadget hunting

Most public stack spoofers (CallStackMasker, generic .pdata gadget scanners) follow the same shortcut: scan every function in kernelbase/kernel32/ntdll, score them by unwind simplicity (low CountOfCodes, no FrameRegister, no handler), pick the smallest matches, and call them "gadgets." The fake chain ends up looking like:

kernelbase!RandomTinyHelper+0x12
kernel32!AnotherShortFunc+0x08
ntdll!ObscureWrapper+0x1A

This passes RtlVirtualUnwind's arithmetic, but the function names are nonsense in the context of a sleeping thread. A defender who builds a baseline of real sleep-chain frames sees garbage names that no real worker thread would ever produce. The spoof is geometrically valid and semantically wrong.

The Shinkiro approach is the opposite: do not search for gadgets. Resolve the actual functions that a real sleeping thread would have on its stack, and parse their real UNWIND_INFO at runtime.

A genuine thread sleeping on WaitForSingleObjectEx has this exact chain:

ntdll!NtWaitForSingleObject+0x14    ← the syscall return point
kernelbase!WaitForSingleObjectEx+0x5f
kernel32!BaseThreadInitThunk+0x20
ntdll!RtlUserThreadStart+0x2c

Each of those return addresses is the byte immediately after a specific CALL instruction inside a specific function. They are not interchangeable with random gadgets. A defender who reads frame names recognizes this chain instantly. A defender who replays the unwind sees frame sizes that match the real PDB-documented layouts.

Resolving the four CALL sites

The resolver walks each target function's body looking for the specific CALL that leads to the next link in the chain. Four x86 patterns cover essentially everything in modern Windows:

Pattern	Bytes	Where it shows up
`call [rip+disp32]`	FF 15 ?? ?? ?? ??	IAT calls, e.g. `WaitForSingleObjectEx → NtWaitForSingleObject`
`call rel32`	E8 ?? ?? ?? ??	Direct intra-module calls
`mov reg, [rip+disp32]` then `call`	48 8B ?? ?? ?? ?? ?? then E8 / FF Dx	CFG dispatch, e.g. `RtlUserThreadStart → guard_dispatch_icall → BaseThreadInitThunk`
`call reg`	FF D0..D7	Register-indirect, e.g. `BaseThreadInitThunk → thread_entry`

Each match is verified by resolving the target pointer and confirming it lands in the expected module. No hardcoded offsets, no version-specific magic numbers. The resolver works on every Windows 10 and 11 build because it locates the CALL by structure, not by address.

Parsing the real UNWIND_INFO

Once the resolver has identified the four target return addresses, it parses each function's actual UNWIND_INFO from .pdata to compute the exact frame layout. This is the same logic Kagura applies, but at runtime, inside the implant, with no precomputed metadata:

// Skeleton, not the full implementation
typedef struct {
    DWORD allocSize;       // total bytes of stack reserved by the prolog
    DWORD numPushed;       // number of PUSH_NONVOL operations
    BOOL  usesFramePtr;    // does the function set a frame pointer?
    BYTE  frameReg;
    DWORD frameOffset;
} TS_FRAME_INFO;

BOOL ParseFrameLayout(PVOID moduleBase, PVOID funcAddr, TS_FRAME_INFO* info) {
    RUNTIME_FUNCTION* rf = FindRuntimeFunction(moduleBase, funcAddr);
    UNWIND_INFO* uwi = (UNWIND_INFO*)((BYTE*)moduleBase + rf->UnwindData);
    UNWIND_CODE* codes = (UNWIND_CODE*)((BYTE*)uwi + sizeof(UNWIND_INFO));

    for (DWORD i = 0; i < uwi->CountOfCodes; ) {
        BYTE op     = codes[i].UnwindOp_OpInfo & 0x0F;
        BYTE opInfo = (codes[i].UnwindOp_OpInfo >> 4) & 0x0F;
        switch (op) {
            case UWOP_PUSH_NONVOL:  info->numPushed++;        i += 1; break;
            case UWOP_ALLOC_SMALL:  info->allocSize += opInfo*8 + 8; i += 1; break;
            case UWOP_ALLOC_LARGE:  /* read slot count, advance i+=2 or i+=3 */ break;
            case UWOP_SET_FPREG:    info->usesFramePtr = TRUE; /* ... */ i += 1; break;
            /* SAVE_NONVOL, SAVE_XMM128, etc. */
            default: i += 1; break;
        }
    }
    return TRUE;
}

The output is the exact frame size, push count, and frame-pointer status of the real function as the compiler emitted it. The fake stack uses these numbers verbatim. When the EDR replays the unwind, the arithmetic matches because the layout was copied from the real function's metadata, not approximated from a generic gadget.

SET_FPREG: bridging the stub into the fake chain

One subtle problem remains: how does the unwinder transition from the spoofing stub's own frame into the fabricated chain? The trick is UWOP_SET_FPREG, declared in the stub's prolog as .setframe rbp, 0. During unwind, the operations replay in this order:

UWOP_ALLOC_SMALL: RSP += 0x40 (the stub's local allocation, ignored because SET_FPREG overrides it)
UWOP_SET_FPREG: RSP = RBP, which the stub previously set to point into the fake-frame zone
UWOP_PUSH_NONVOL: pop RBP, RSP += 8
Return address = [RSP] = WaitForSingleObjectEx+0x5f

From step 4 onwards, the unwinder is inside real kernelbase code, walking real .pdata. Every subsequent frame is geometrically and semantically perfect, because the unwinder is no longer being tricked, it is following genuine metadata that just happens to point to addresses we deliberately seeded in our fabricated stack zone.

The result, validated against WinDbg's kv command, is indistinguishable from a real sleeping thread.

The chicken and egg of the spoofer

To spoof the stack, code must execute. To encrypt code, code must execute. Who runs first?

The sequence is fixed:

1. Encrypt the heap            (Layer 3, covered next)
2. Save the real stack context  (so we can restore it on wake)
3. Write the fake stack frames on top of the real ones
4. Encrypt the .text           (Layer 1)
5. Enter the wait
                 (sleep)
6. Wake-up mechanism fires
7. Decrypt the .text           (Layer 1, reverse)
8. Restore the real stack       (Layer 2, reverse)
9. Decrypt the heap             (Layer 3, reverse)
10. Resume normal execution

The wake-up trampoline must live somewhere in memory that can execute while everything else is encrypted. It must also restore the saved stack context and call the decryption routines in order. This is the core residue of Layer 2. No matter how clever you are, something has to stay in clear and be callable externally.

What Layer 2 still leaves open

Failure 1: the heap is still naked. Same problem as before. Now that the code and the stack are both protected, the heap becomes the single largest remaining surface, and all the interesting data lives there.

Failure 2: the spoofer's own code is visible. The spoofing routine, the targeted resolver, the unwind parser, and the XOR primitive all need to execute during the sleep transition. Where does this code live, and what protects it?

Failure 3: MEM_PRIVATE does not change. Layer 2 hides your code from the stack, but the region itself is still MEM_PRIVATE with executable permissions. Moneta still flags it as "Private Executable Memory." The scanner just has less context to act on.

The last one, Failure 3, is the one most articles skip. Encrypting the content does not change the memory type. The next layer, Layer 3, does not fix it either. There is a fourth trick, and I will cover it in the final section.

Part IV: Layer 3, Heap Encryption

The principle

Before sleeping, encrypt every tracked heap allocation. On wake, decrypt. While sleeping, the heap contains only ciphertext. YARA rules looking for the C2 URL find nothing. String scrapers find nothing. A full process dump yields a blob of high-entropy bytes where the configuration used to be.

Simple on the surface. Three non-trivial constraints in practice.

Constraint 1: you cannot encrypt the process default heap

If the beacon is injected into explorer.exe or any host process, the default heap belongs to the host. It is used constantly by system threads. If you encrypt it, you break the host process. Crash. Game over.

The solution is a private heap created via RtlCreateHeap. Every allocation used by the beacon goes through a wrapper that allocates on this private heap and tracks the block in a linked list:

// Private heap creation
g_privateHeap = RtlCreateHeap(HEAP_GROWABLE, NULL, 0, 0, NULL, NULL);

// Tracked allocation
void* kg_alloc(size_t sz) {
    void* p = RtlAllocateHeap(g_privateHeap, 0, sz);
    kg_track(p, sz);    // add to tracked-blocks list
    return p;
}

The benefit is twofold. First, encrypting only the private heap leaves the host untouched. Second, a scanner walking the default heap of the host process finds nothing of yours. Another surface removed.

Constraint 2: where the encryption routine executes

This is the constraint that most implementations get wrong. If your XOR or AES routine lives in the beacon's .text, then during the encryption loop:

Your crypto code must be in clear (otherwise it cannot execute)
The thread stack during the encryption points into your module
Moneta observes a MEM_PRIVATE region running tight loops over heap memory. This is, itself, a behavioral signature

The elegant solution is to use a crypto primitive inside a Microsoft-signed DLL. advapi32!SystemFunction032 is RC4 exposed from a signed module. bcrypt!BCryptEncrypt is AES exposed from another signed module. When you call these functions, the encryption runs from MEM_IMAGE memory, not yours.

Wrap the call in a call stack spoof from Layer 2, and during encryption the EDR sees:

advapi32!SystemFunction032+0x??
kernelbase!SomeGadget+0x??          ← fake frame
kernel32!BaseThreadInitThunk+0x14   ← fake frame
ntdll!RtlUserThreadStart+0x21       ← terminator

No frame of the stack points to your beacon. The encryption looks like legitimate use of a system crypto primitive.

Constraint 3: ephemeral keys and clean free

The encryption key is itself a secret. Store it in a global variable and it lives in your BSS segment in clear while sleeping. That is a key-recovery vector for anyone who snapshots the process at any point.

Two rules fix this:

Rule A: derive a fresh key every sleep cycle. Source of entropy: __rdtsc(), optionally mixed with QueryPerformanceCounter(). The key exists from the moment encryption begins until decryption completes, then it is zeroed.

// Fresh 16-byte key per cycle
for (int i = 0; i < 16; i++)
    key[i] = (BYTE)(__rdtsc() >> (i * 4));

Rule B: zero with volatile to defeat the compiler. A plain memset(key, 0, 16) followed by no further reads is an optimization target. The compiler will eliminate it because the bytes are never read again. The fix:

// Volatile zero, survives optimization
for (int i = 0; i < 16; i++)
    ((volatile BYTE*)key)[i] = 0;

A process snapshot taken during sleep finds no key. The key simply does not exist in memory outside of the encryption and decryption windows, which are a few microseconds long.

Same principle applies to free. A freed block sits in the private heap's free list with its contents intact until it is reallocated. If you free a buffer containing a credential and then sleep, the credential survives in the freed block. Volatile-zero the block before calling RtlFreeHeap and the residue disappears.

What Layer 3 still leaves open

After three layers, here is what a memory scan during sleep sees:

Code: ciphertext, high entropy, unrecognizable (Layer 1)
Stack: legitimate-looking chain of system frames (Layer 2)
Heap: ciphertext, no readable strings, no config (Layer 3)

But the region types have not changed. The beacon's .text is still MEM_PRIVATE + RX even if its contents are encrypted. Moneta still flags "Private Executable Memory." The content is opaque but the region itself remains suspicious.

The wake-up trampoline is still a residue. It must stay in clear. It is small and signed-looking if you used ROP through advapi32, but it is there.

Every protection change, if you used Layer 1 with NtProtectVirtualMemory, still triggers kernel callbacks.

Three layers have closed the content-based detections. But memory-type detections and behavioral detections remain. This is where the dual-mapping technique comes in.

Part V: The Fourth Trick, Dual Mapping

The unsolvable contradiction

Layer 1 as described above relies on NtProtectVirtualMemory to switch between executable and writable, because XOR encryption needs write access and normal execution needs execute access. These two permissions cannot coexist on the same page. Or can they?

The contradiction is actually a constraint of a single view. One virtual address range with one set of permissions. If you had two views, each with different permissions, both pointing at the same physical memory, the problem disappears.

This is exactly what the Windows memory manager lets you do.

The primitive

A section object in Windows, created with NtCreateSection, can be mapped into a process multiple times. Each mapping is a view, and each view has its own set of Page Table Entries with its own permission bits. All views of the same section share the same physical pages.

              Physical RAM
              +--------------+
              | beacon code  |
              |  (plaintext  |
              |   or cipher) |
              +--------------+
                 ^        ^
                 | same   | same
                 | PFN    | PFN
      View_RW ---+        +--- View_RX
      (PAGE_READWRITE)    (PAGE_EXECUTE_READ)
      random addr         original .text addr

Any write to View_RW lands on the same physical page that View_RX reads from. There is no Copy-on-Write, because the section is pagefile-backed (FileHandle = NULL at creation) and both views use compatible protections (not WRITECOPY). The write is instantly visible through View_RX.

This is not a new Windows feature. Multiple views of the same section are the mechanism behind shared memory, inter-process communication, and Peruns Fart-style ntdll unhooking (where two views of ntdll.dll are mapped, one of them naturally unhooked). The innovation in the context of sleep obfuscation is the specific application: map View_RX at the exact virtual address the beacon's .text was originally loaded at.

The workflow

At beacon startup, before the first sleep, the remap runs once:

Step 1: NtCreateSection(PAGE_EXECUTE_READWRITE, SEC_COMMIT,
                        size = beaconTextSize, file = NULL)

Step 2: NtMapViewOfSection(section, addr = NULL, PAGE_READWRITE)
         → View_RW at random address

Step 3: memcpy(View_RW, beacon.text, size)   ← copy live code
                                                into the section pages

Step 4: NtFreeVirtualMemory(beacon_allocation, MEM_RELEASE)
         → original .text and everything around it is released

Step 5: NtMapViewOfSection(section, addr = original_text_addr,
                           PAGE_EXECUTE_READ)
         → View_RX at the EXACT address the .text used to be

Step 6: reallocate and restore the surrounding regions (PE headers,
         .data, .rdata) from a prior backup

After this sequence, the beacon has two views of the same physical pages. View_RX sits at the exact address the code was originally linked against, so every internal pointer, every relocation, every vtable, every function pointer still resolves correctly. View_RW sits at a random address that only the sleep routine knows about.

The sleep loop becomes trivial

With the dual-mapping in place, the actual sleep cycle collapses to:

void ss_sleep(DWORD sleepMs) {
    // Generate a fresh 16-byte key
    BYTE key[16];
    for (int i = 0; i < 16; i++) key[i] = (BYTE)(__rdtsc() >> (i * 4));

    // Encrypt via View_RW (PAGE_READWRITE)
    // Instantly visible as ciphertext through View_RX (PAGE_EXECUTE_READ)
    BYTE* p = (BYTE*)g_viewRW;
    for (size_t i = 0; i < g_textSize; i++) p[i] ^= key[i & 0xF];

    // Wait (indirect syscall via matched gadget, fake RBP chain)
    LARGE_INTEGER t;
    t.QuadPart = -(LONGLONG)sleepMs * 10000LL;
    NtWaitForSingleObject(g_hSleepEvent, FALSE, &t);

    // Decrypt (XOR is symmetric)
    for (size_t i = 0; i < g_textSize; i++) p[i] ^= key[i & 0xF];

    // Volatile zero of the key
    for (int i = 0; i < 16; i++) ((volatile BYTE*)key)[i] = 0;
}

Zero calls to NtProtectVirtualMemory per sleep cycle. No kernel callbacks on protection changes. No ETW protection-modification events. No flip-flop pattern. The View_RX permissions never change. Only the bytes underneath flip between plaintext and ciphertext.

Bonus: MEM_MAPPED instead of MEM_PRIVATE

Pages created through NtMapViewOfSection are of type MEM_MAPPED, not MEM_PRIVATE. This is a significant OPSEC improvement:

Detection	MEM_PRIVATE + RX	MEM_MAPPED + RX
Moneta "Unbacked Shellcode"	Flagged	Not triggered by type
Elastic "Shellcode Thread"	High confidence	Not triggered
MDE "Anomalous Executable Memory"	Classified suspicious	Classified low severity
SentinelOne "Suspicious Memory Region"	Detected	Not detected in default configs

MEM_MAPPED + RX is common in every Windows process: memory-mapped files, IPC regions, shared memory between processes. The beacon's executable region, once dual-mapped, looks exactly like any legitimate section-backed mapping.

The tricky part: self-contained remap

Step 4 of the workflow releases the beacon's own allocation. If the remap function is inside that allocation, the instant NtFreeVirtualMemory succeeds, the code executing it disappears. Crash.

The solution is to copy the remap function into a separate RWX buffer and execute it from there:

// Measure the function size via a sentinel symbol
size_t funcSize = (size_t)((BYTE*)&_DoRemapEnd - (BYTE*)&_DoRemap);

// Allocate a buffer outside the beacon's main allocation
void* funcBuf = VirtualAlloc(NULL, funcSize,
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);
memcpy(funcBuf, (void*)&_DoRemap, funcSize);

// Execute from the separate buffer
typedef NTSTATUS (*RemapFn)(REMAP_CTX*);
RemapFn fn = (RemapFn)funcBuf;
NTSTATUS st = fn(ctx);

The buffer survives the release. The function copied inside it does too. But it has to be written under strict rules:

No references to global variables, because globals live in .data which is about to be freed. Every [rip + global] becomes a dangling pointer
No memcpy or memset, because these resolve to RIP-relative calls into the freed .text. Byte-by-byte copies with volatile pointers replace them
No debugging macros, no logger calls, no runtime dependencies
All the ntdll pointers it needs (NtCreateSection, NtMapViewOfSection, NtAllocateVirtualMemory, etc.) must be passed in through a context pointer, not resolved from globals

The discipline is similar to writing position-independent shellcode, because for a few dozen microseconds, the function is exactly that.

The critical-zone sentinel

Not all of the beacon's .text can be encrypted during sleep. The sleep routine itself, the indirect syscall wrappers, the call stack spoofing stubs, all have to remain executable while the rest of the code is ciphertext. If they were encrypted, the beacon could not even perform NtWaitForSingleObject to enter the sleep.

The trick is to split .text in two by placing a sentinel symbol at the boundary:

__attribute__((noinline))
void SleepSentinel(void) { asm volatile("ret\n"); }

Link order is arranged so that every offensive module (agent, commander, crypto, C2) sits before the sentinel, and every infrastructure module (syscall wrappers, stack spoofing, sleep logic) sits after:

[  Encryptable zone            |  SleepSentinel  |  Critical zone        ]
[  Agent, Commander, Crypt,... |                 |  Syscalls, CallSpoof, ]
[                               |                 |  WaitMask, SleepLogic ]
[  encrypted during sleep       |                 |  stays in clear       ]
[  80 to 90 percent of .text    |                 |  10 to 20 percent     ]

The sleep routine computes the encryptable size as &SleepSentinel - g_textBase, and only encrypts up to that point. The critical zone remains plaintext and serves as the permanent wake-up trampoline. No timer queue, no APC, no ROP chain. The beacon's own thread performs the encryption, waits, and performs the decryption, all from code that was never encrypted.

The complete sleep cycle, end to end

One diagram, the full theoretical wiring of a sleep cycle once the dual-mapping is in place. Entry, dark phase, exit. No source code, just the order of operations.

+--------------------------------------------------------------------+
|                       SLEEP ENTRY (active thread)                   |
+--------------------------------------------------------------------+

  Beacon main loop calls Sleep(ms)
            |
            v
  [1] Generate ephemeral 16-byte key
       - RtlGenRandom + __rdtsc mix for entropy
       - Key lives only in a stack local
            |
            v
  [2] Encrypt the heap (private heap only)
       - Iterate tracked allocations
       - Call advapi32!SystemFunction032 (RC4)
         wrapped in CallStack Spoofing
       ===> crypto runs from MEM_IMAGE signed DLL
            stack shows advapi32 -> ntdll, not the beacon
            |
            v
  [3] Spoof the thread stack
       - Save real RBP/RSP context to a hidden slot
       - Write fake frames pointing to:
              ntdll!NtWaitForSingleObject+0x14
              kernelbase!WaitForSingleObjectEx+0x5f
              kernel32!BaseThreadInitThunk+0x20
              ntdll!RtlUserThreadStart+0x2c
       - SET_FPREG bridges the spoof stub into the real
         .pdata chain so RtlVirtualUnwind walks cleanly
            |
            v
  [4] Encrypt the .text via View_RW (PAGE_READWRITE)
       for i in 0..encryptable_size:
           View_RW[i] ^= key[i & 0xF]
       ===> instantly visible as ciphertext through View_RX
            because both views share the same physical pages
            zero NtProtectVirtualMemory calls involved
            |
            v
  [5] Issue the wait
       NtWaitForSingleObject(sleepEvent, FALSE, timeout)
       via Shinkiro indirect syscall:
           - matched syscall;ret gadget at +0x12 of the real stub
           - 3 fake RBP-linked frames built before the syscall
            |
            v
+--------------------------------------------------------------------+
|                    SLEEP DARK PHASE (thread waiting)                |
+--------------------------------------------------------------------+

       ==========================================================
       ===   .text body   :  ciphertext (View_RX, MEM_MAPPED)  ===
       ===   private heap :  ciphertext (RC4)                  ===
       ===   thread stack :  legitimate sleep chain            ===
       ===   region type  :  MEM_MAPPED + RX (no Unbacked SC)  ===
       ===   key          :  zeroed before wait                ===
       ===   protections  :  ZERO NtProtectVirtualMemory calls ===
       ==========================================================

       Memory scanner observation during this window:
       - YARA on .text     : no match (ciphertext)
       - Moneta            : MEM_MAPPED, no Unbacked alert
       - Hunt-Sleeping     : stack walks to RtlUserThreadStart
       - PE-sieve dump     : heap blocks are RC4 ciphertext
       - ETW protect-flip  : no events emitted
       - Stack-walk EDR    : every frame in a signed module

            |
            v
+--------------------------------------------------------------------+
|                     SLEEP EXIT (wait returns)                       |
+--------------------------------------------------------------------+

  [6] Wait returns (timeout reached or event signaled)
            |
            v
  [7] Decrypt the .text via View_RW (XOR symmetric, same key)
       ===> code is executable again through View_RX
            |
            v
  [8] Restore the real thread stack from saved context
       - RBP/RSP back to the pre-spoof values
       - Fake frames overwritten (zeroed)
            |
            v
  [9] Decrypt the heap (SystemFunction032 again, RC4 symmetric)
       Same CallStack Spoofing wrapper
            |
            v
  [10] Volatile-zero the key
        for i in 0..15: ((volatile BYTE*)key)[i] = 0
        ===> defeats the optimizer that would otherwise
             remove the dead store
            |
            v
  Back to main loop -- the beacon is fully reconstituted
  in memory as if nothing happened. Ready for the next
  C2 check-in.

Five layers of evasion stacked on a single execution path: dual-mapped section for the code, targeted-resolution stack spoofing for the thread, signed-DLL crypto wrapped in call stack spoofing for the heap, ephemeral keys with volatile zero, and a matched-gadget indirect syscall for the wait itself. No external trampoline. No timer queue. No APC. The thread that goes to sleep is the same thread that wakes up, and during the sleep window every observable surface of the beacon is either encrypted, spoofed, or backed by a signed module.

Part VI: The Full Picture

What each layer closes, what remains

Detection vector	Layer that closes it
YARA on .text (static content)	Layer 1 (code encryption)
Moneta "Private Executable Memory" by content	Layer 1 (content is ciphertext)
Moneta "Unbacked Shellcode" by memory type	Dual mapping (MEM_PRIVATE becomes MEM_MAPPED)
Hunt-Sleeping-Beacons stack walk	Layer 2 (stack spoofing)
CrowdStrike unwind validation	Layer 2 (targeted resolution + real UNWIND_INFO)
Elastic RtlVirtualUnwind check	Layer 2 (SET_FPREG bridge into real .pdata chain)
Protection change flip-flop telemetry	Dual mapping (zero NtProtectVirtualMemory calls)
ETW Microsoft-Windows-Kernel-Memory	Dual mapping (no protection changes emitted)
YARA on heap strings (URL, config)	Layer 3 (heap encryption)
PE-sieve process dump of heap	Layer 3 (heap encryption)
Key recovery from memory snapshot	Ephemeral keys + volatile zero
Default-heap fingerprint of beacon data	Private heap (RtlCreateHeap)

What still remains visible

Honesty matters here. The following signals are still detectable:

The critical zone. Between 10 and 30 kilobytes of MEM_PRIVATE + RX code remains in clear after the sentinel. It contains the syscall wrappers, the call stack spoofing stubs, and the sleep routine itself. Moneta still flags this small residue as "Private Executable Memory." The mitigation requires a custom linker script that places the critical zone in a 64KB-aligned section suitable for a second dual-mapping, with the associated globals relocated to the heap. That is future work.

Behavioral patterns. The beacon wakes up, makes a DNS query, issues an HTTP request, sleeps again. This rhythm is detectable independently of all the memory-level protections. Modern defense is moving away from content inspection and toward behavioral correlation. No amount of encryption hides an obvious periodic callback.

The wake-up micro-window. For a few microseconds between XOR and NtWaitForSingleObject, and again between NtWaitForSingleObject returning and the decryption XOR, the full state of the beacon is in clear. An EDR that triggers memory scans on specific events (thread alert, protection change, APC queue) can, in theory, land inside this window. The probability is low but not zero.

Full MEM_MAPPED auditing. A theoretically possible detection correlates every MEM_MAPPED region with its backing file. A pagefile-backed section without any corresponding file on disk is distinguishable from a legitimate memory-mapped file. I have not seen an EDR product implement this correlation in real time, likely because it is expensive. It is on the horizon.

Closing thoughts

A beacon that bypasses CrowdStrike Falcon, Microsoft Defender for Endpoint, SentinelOne, and Elastic Security during sleep is not the product of a single technique. It is the intersection of three layers that together close every content-based detection surface, plus a fourth trick that closes the memory-type and behavioral telemetry that the three layers leave exposed.

If you skip Layer 1, your code is a YARA target. If you skip Layer 2, Hunt-Sleeping-Beacons finds you in ten seconds. If you skip Layer 3, your C2 URL sits in plain text on the heap for anyone to dump. If you skip the dual-mapping, your MEM_PRIVATE + RX region and your 480 NtProtectVirtualMemory calls per hour are both independent high-confidence signals.

This full sleep obfuscation system is now integrated into Kagemusha, my custom C2 framework built from scratch in C and x86-64 assembly. With Kagemusha, the beacon sleeps undetected under Elastic Security with full detection rules enabled, CrowdStrike Falcon at Prevention level 3, Microsoft Defender for Endpoint, and SentinelOne. The architecture described in this post is not theoretical anymore. It runs in production on real engagements.

The tooling behind this architecture (Shinkirō for matched-gadget indirect syscalls, Kagura for stack frame metadata, the dual-mapping remap, the private heap with signed-DLL encryption) came out of two years of studying C2 frameworks and the last eight months spent grinding through implant and malware development. I built it because off-the-shelf beacons burn on every top-tier engagement. I will not release the source, because that would accelerate the exact detection research I am trying to stay ahead of. But the architecture is not a secret. Every idea in this post is implementable by anyone who reads the Windows memory-manager documentation carefully and has patience for assembly-level detail.

The mirror reflects nothing during sleep. That is the goal. Three layers of encryption, a dual-mapped section, a signed crypto primitive, a private heap, an ephemeral key, a sentinel-bounded critical zone, and an indirect-syscall engine underneath. Subtract any one, and the reflection returns.

Find me on LinkedIn // Enenra

References

Microsoft, NtCreateSection and NtMapViewOfSection, MSDN
Microsoft, Section Objects and Views, Windows Internals 7th ed., Russinovich et al.
Microsoft, ObRegisterCallbacks kernel memory callbacks, MSDN
Upping the Ante: Detecting In-Memory Threats with Kernel Call Stacks, Elastic Security Labs
Doubling Down: ETW Callstacks, Elastic Security Labs
Moneta, Forrest Orr, Live Memory Usermode IOC Scanner
PE-sieve, hasherezade, Process Memory Scanner
Hunt-Sleeping-Beacons, theflink
Ekko, Cracked5pider, Sleep Obfuscation via Timer Queue
Foliage, Sleep Obfuscation via APC
Zilean, Sleep Obfuscation via Thread Hijacking
AceLdr, Kyle Avery, Cobalt Strike UDRL with Sleep Mask
DeathSleep, CONTEXT manipulation sleep obfuscation
Kagura-StackWalker: The Stack Is a Dance, Enenra, 2026
Shinkirō: Matched-Gadget Indirect Syscalls, Enenra, 2026