This guide is a detailed, self-referential resource for enhancing the x86-16 real-mode decompiler in the angr-platforms project. It draws from an exhaustive analysis of the project's source code (primarily the X86_16 module), angr's integration lifecycle (from tutorials), practical usage in root files like decompile.py and test_decompile.py, identified incompletenesses via code searches, and a thorough investigation of the angr core source code installed via pip in the venv (at /home/xor/vextest/venv/lib/python3.12/site-packages/angr). The angr investigation focused on key modules for decompiler/CFG integration, revealing how x86-16's custom Lifter feeds into angr's VEX-based analyses. The focus is on expanding instruction coverage, refining real-mode features (segmentation, interrupts, I/O), boosting VEX IR accuracy for symbolic execution, and enabling reliable pseudocode generation.
The x86-16 extension allows angr to lift 16-bit real-mode binaries (e.g., DOS .COM/.EXE) to VEX IR for decompilation, without a complete angr patch (upstream integration recommended for register/VEX support). Core strengths: Gymrat-inspired lifter for VEX, custom ModR/M parsing, blob loading for raw bytes. Limitations: ~70% instruction coverage, stubbed hardware, sparse tests, symbolic gaps.
angr (installed via pip in venv/lib/python3.12/site-packages/angr) is a binary analysis framework using VEX IR for symbolic execution. Key components from source:
-
Project (
project.py) : Central class (lines 56-847) initializing CLE loader (line 150: cle.Loader for blob/arch), arch/simos, hooks SimProcedures for unresolved symbols (lines 300-444: _register_object scans imports, replaces with SIM_PROCEDURES like stubs/ReturnUnconstrained for unknown calls), factory for states (line 233: AngrObjectFactory), analyses hub (lines 290-295: AnalysesHub with presets for CFG/Decompiler). For x86-16, Project loads Blob, sets Arch, registers Lifter, enabling IRSB lifting. -
Decompiler (
analyses/decompiler/decompiler.py) : Analysis (lines 44-734) for Function to pseudocode. Pipeline: Clinic (VEX IRSB to AIL, lines 264-290), RegionIdentifier (control regions, lines 324-425), RecursiveStructurer/Phoenix (loop/if structuring, lines 352-359), RegionSimplifier (simplification, lines 360-370), CStructuredCodeGenerator (C output, lines 383-401). Relies on accurate IRSB; x86-16 gaps (unknown ops) cause unoptimized AIL (e.g., NoOp sequences, lines 152-413: _decompile with resilience/fallback to basic preset). -
CFG (
analyses/cfg/cfg.py) : Wrapper for CFGFast (lines 13-74: compatibility, raises OutdatedError for emulated params). Builds graph from IRSB lifting (fast mode); x86-16 incomplete IRSB leads to missed edges/nodes (no context-sensitivity by default). -
SimOS (
simos/simos.py) : Base for OS config (lines 25-444): state_blank for init (registers/memory, lines 90-230), syscall handling (syscall_from_number dummy, lines 333-340), prepare_call_state for ABI (lines 277-302), GDT for x86 segments (generate_gdt/setup_gdt, lines 352-444). x86-16's simos_86_16 extends for DOS INT (AX syscall) but lacks full SimProcedures, causing unresolved calls (e.g., INT 21h stubs).
VEX engine (engines/vex/ subdirs like claripy/heavy/light): Uses pyvex.lift for IRSB; custom x86-16 Lifter integrates via register, but gaps propagate to CFG (missed blocks) and Decompiler (garbled code).
angr's modularity allows x86-16 extension without core changes, but upstream patches (e.g., for 16-bit VEX) would improve stability.
angr's pipeline: CLE loading → Arch identification → VEX lifting → SimEngineVEX execution → SimOS simulation. x86-16 fits as follows:
- Loading (CLE): Blob backend for raw binaries (e.g.,
test2.bin), entry at 0x0. No native EXE support; extend for MZ headers if needed. - Architecture (archinfo.Arch):
arch_86_16.pysubclasses Arch with 16-bit bits