Skip to content

Latest commit

 

History

History
28 lines (16 loc) · 4.4 KB

File metadata and controls

28 lines (16 loc) · 4.4 KB

Comprehensive Guide to Improving the x86-16 Bit Decompiler in angr-platforms

This guide is a detailed, self-referential resource for enhancing the x86-16 real-mode decompiler in the angr-platforms project. It draws from an exhaustive analysis of the project's source code (primarily the X86_16 module), angr's integration lifecycle (from tutorials), practical usage in root files like decompile.py and test_decompile.py, identified incompletenesses via code searches, and a thorough investigation of the angr core source code installed via pip in the venv (at /home/xor/vextest/venv/lib/python3.12/site-packages/angr). The angr investigation focused on key modules for decompiler/CFG integration, revealing how x86-16's custom Lifter feeds into angr's VEX-based analyses. The focus is on expanding instruction coverage, refining real-mode features (segmentation, interrupts, I/O), boosting VEX IR accuracy for symbolic execution, and enabling reliable pseudocode generation.

The x86-16 extension allows angr to lift 16-bit real-mode binaries (e.g., DOS .COM/.EXE) to VEX IR for decompilation, without a complete angr patch (upstream integration recommended for register/VEX support). Core strengths: Gymrat-inspired lifter for VEX, custom ModR/M parsing, blob loading for raw bytes. Limitations: ~70% instruction coverage, stubbed hardware, sparse tests, symbolic gaps.

1. Project Overview and angr Integration

angr Core Investigation

angr (installed via pip in venv/lib/python3.12/site-packages/angr) is a binary analysis framework using VEX IR for symbolic execution. Key components from source:

  • Project (project.py) : Central class (lines 56-847) initializing CLE loader (line 150: cle.Loader for blob/arch), arch/simos, hooks SimProcedures for unresolved symbols (lines 300-444: _register_object scans imports, replaces with SIM_PROCEDURES like stubs/ReturnUnconstrained for unknown calls), factory for states (line 233: AngrObjectFactory), analyses hub (lines 290-295: AnalysesHub with presets for CFG/Decompiler). For x86-16, Project loads Blob, sets Arch, registers Lifter, enabling IRSB lifting.

  • Decompiler (analyses/decompiler/decompiler.py) : Analysis (lines 44-734) for Function to pseudocode. Pipeline: Clinic (VEX IRSB to AIL, lines 264-290), RegionIdentifier (control regions, lines 324-425), RecursiveStructurer/Phoenix (loop/if structuring, lines 352-359), RegionSimplifier (simplification, lines 360-370), CStructuredCodeGenerator (C output, lines 383-401). Relies on accurate IRSB; x86-16 gaps (unknown ops) cause unoptimized AIL (e.g., NoOp sequences, lines 152-413: _decompile with resilience/fallback to basic preset).

  • CFG (analyses/cfg/cfg.py) : Wrapper for CFGFast (lines 13-74: compatibility, raises OutdatedError for emulated params). Builds graph from IRSB lifting (fast mode); x86-16 incomplete IRSB leads to missed edges/nodes (no context-sensitivity by default).

  • SimOS (simos/simos.py) : Base for OS config (lines 25-444): state_blank for init (registers/memory, lines 90-230), syscall handling (syscall_from_number dummy, lines 333-340), prepare_call_state for ABI (lines 277-302), GDT for x86 segments (generate_gdt/setup_gdt, lines 352-444). x86-16's simos_86_16 extends for DOS INT (AX syscall) but lacks full SimProcedures, causing unresolved calls (e.g., INT 21h stubs).

VEX engine (engines/vex/ subdirs like claripy/heavy/light): Uses pyvex.lift for IRSB; custom x86-16 Lifter integrates via register, but gaps propagate to CFG (missed blocks) and Decompiler (garbled code).

angr's modularity allows x86-16 extension without core changes, but upstream patches (e.g., for 16-bit VEX) would improve stability.

angr Lifecycle Alignment

angr's pipeline: CLE loading → Arch identification → VEX lifting → SimEngineVEX execution → SimOS simulation. x86-16 fits as follows:

  • Loading (CLE): Blob backend for raw binaries (e.g., test2.bin), entry at 0x0. No native EXE support; extend for MZ headers if needed.
  • Architecture (archinfo.Arch): arch_86_16.py subclasses Arch with 16-bit bits