Skip to content

Commit cbf8a9e

Browse files
committed
local: add feat diff-test
add disasm_diff.py
1 parent a1f2002 commit cbf8a9e

10 files changed

Lines changed: 3621 additions & 0 deletions

File tree

diff-test/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
reports
2+
samples

diff-test/README.md

Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# M68K Differential Test — Capstone vs objdump
2+
3+
Automated differential testing pipeline that compares Capstone's M68K disassembly output against `m68k-linux-gnu-objdump` as a reference implementation.
4+
5+
## Prerequisites
6+
7+
1. **Capstone build**`libcapstone.so` must exist in `../build/` (the library path is auto-detected)
8+
2. **Python venv**`../.venv/` with the Capstone Python bindings installed
9+
3. **m68k-linux-gnu-objdump** — M68K cross-binutils on `PATH`
10+
11+
```Shell
12+
# from the repo root
13+
which m68k-linux-gnu-objdump # must be on PATH
14+
ls build/libcapstone.so.6 # must exist
15+
ls .venv/bin/activate # must exist
16+
```
17+
18+
## Quick Start
19+
20+
```Shell
21+
cd diff-test/
22+
23+
# full pipeline: download 5 ROMs + test + report
24+
./run.sh
25+
26+
# test-only (skip download, use existing ROMs)
27+
./run.sh --test-only --max-bytes 8192
28+
29+
# download-only
30+
./run.sh --download-only --num-samples 10
31+
32+
# test a specific m68k sub-architecture
33+
./run.sh --test-only --max-bytes 8192 --arch m68k:68000
34+
35+
# test ColdFire variant
36+
./run.sh --test-only --max-bytes 8192 --arch m68k:cfv4e
37+
38+
# list all 42 supported architecture variants
39+
./run.sh --list-archs
40+
```
41+
42+
## File Structure
43+
44+
```
45+
diff-test/
46+
├── run.sh # entry point — sets up env vars, calls pipeline.py
47+
├── pipeline.py # orchestrator — download → test → aggregate reports
48+
├── diff_test.py # per-ROM diff test runner
49+
├── normalize.py # normalization engine + comparison logic (13 rules)
50+
├── download_samples.py # downloads ROMs from Myrient (No-Intro Genesis)
51+
├── test_normalize.py # 144 unit tests
52+
├── samples/ # downloaded ROMs (auto-created)
53+
│ ├── zips/
54+
│ └── roms/
55+
├── reports/ # generated reports (auto-created)
56+
│ ├── aggregate_summary.<arch>.json
57+
│ ├── aggregate_summary.<arch>.txt
58+
│ └── <rom_name>.<arch>.report.{json,txt}
59+
└── README.md
60+
```
61+
62+
## Usage
63+
64+
### `run.sh` (recommended)
65+
66+
Wrapper that activates the venv, sets `LIBCAPSTONE_PATH` / `LD_LIBRARY_PATH`, then calls `pipeline.py`.
67+
68+
```Shell
69+
./run.sh [pipeline.py args...]
70+
```
71+
72+
### `pipeline.py`
73+
74+
```
75+
pipeline.py [-h] [--download-only] [--test-only]
76+
[--samples-dir DIR] [--num-samples N] [--seed N]
77+
[--max-bytes N] [--arch ARCH] [--list-archs]
78+
[--report-dir DIR] [--report-format {json,text,both}] [-v]
79+
```
80+
81+
| Flag | Default | Description |
82+
| ----------------- | ---------- | ------------------------------------------------- |
83+
| `--download-only` | | Download samples only, skip testing |
84+
| `--test-only` | | Test existing samples, skip download |
85+
| `--samples-dir` | `samples/` | ROM storage directory (relative to script) |
86+
| `--num-samples` | `5` | Number of ROMs to download |
87+
| `--seed` | `42` | Random seed for sample selection |
88+
| `--max-bytes` | `0` (all) | Max bytes per ROM to disassemble |
89+
| `--arch` | `m68k` | Architecture variant to test (see `--list-archs`) |
90+
| `--list-archs` | | Print all 42 supported arch variants and exit |
91+
| `--report-dir` | `reports/` | Report output directory (relative to script) |
92+
| `--report-format` | `both` | `json`, `text`, or `both` |
93+
| `-v` | | Verbose output |
94+
95+
### `diff_test.py` (single ROM)
96+
97+
> **Note:** When running Python scripts directly (without `run.sh`), always `cd` into `diff-test/` first. The default paths (`./samples`, `./reports`) resolve relative to the script directory.
98+
99+
```Shell
100+
cd diff-test/
101+
source ../.venv/bin/activate
102+
export LIBCAPSTONE_PATH=../build LD_LIBRARY_PATH=../build
103+
104+
python diff_test.py samples/roms/some_rom.md --max-bytes 8192
105+
106+
# diff_test.py is invoked programmatically by pipeline.py; use pipeline.py --arch for arch selection
107+
```
108+
109+
## Architecture Support
110+
111+
The pipeline supports all 42 m68k sub-architecture variants via the `--arch` flag.
112+
Variants are mapped to the closest available Capstone mode; ColdFire/ISA variants
113+
use M68K 040 as the nearest equivalent.
114+
115+
```Shell
116+
./run.sh --list-archs
117+
```
118+
119+
| Category | Variants |
120+
| ------------- | ------------------------------------------------------------------------ |
121+
| Classic M68K | `m68k` (default), `m68k:68000`, `m68k:68008`, `m68k:68010`, `m68k:68020`, `m68k:68030`, `m68k:68040`, `m68k:68060`, `m68k:cpu32`, `m68k:fido` |
122+
| ISA-A | `m68k:isa-a`, `m68k:isa-a:nodiv`, `m68k:isa-a:mac`, `m68k:isa-a:emac` |
123+
| ISA-A+ | `m68k:isa-aplus`, `m68k:isa-aplus:mac`, `m68k:isa-aplus:emac` |
124+
| ISA-B | `m68k:isa-b`, `m68k:isa-b:nousp`, `m68k:isa-b:nousp:mac`, `m68k:isa-b:nousp:emac`, `m68k:isa-b:mac`, `m68k:isa-b:emac`, `m68k:isa-b:float`, `m68k:isa-b:float:mac`, `m68k:isa-b:float:emac` |
125+
| ISA-C | `m68k:isa-c`, `m68k:isa-c:mac`, `m68k:isa-c:emac`, `m68k:isa-c:nodiv`, `m68k:isa-c:nodiv:mac`, `m68k:isa-c:nodiv:emac` |
126+
| ColdFire | `m68k:5200`, `m68k:5206e`, `m68k:5307`, `m68k:5407`, `m68k:528x`, `m68k:521x`, `m68k:5249`, `m68k:547x`, `m68k:548x`, `m68k:cfv4e` |
127+
128+
## Running Tests
129+
130+
```Shell
131+
cd diff-test/
132+
source ../.venv/bin/activate
133+
python -m pytest test_normalize.py -v
134+
```
135+
136+
## How It Works
137+
138+
1. **Download**`download_samples.py` fetches Sega Genesis ROMs from [Myrient No-Intro](https://myrient.erista.me/files/No-Intro/Sega%20-%20Mega%20Drive%20-%20Genesis/). Genesis is cartridge-based (No-Intro), not disc-based (Redump).
139+
140+
2. **Disassemble** — Each ROM is fed to both:
141+
* Capstone Python bindings (`CS_ARCH_M68K`, `CS_MODE_M68K_040`, skipdata enabled)
142+
* `m68k-linux-gnu-objdump -b binary -m m68k -M motorola -D`
143+
144+
3. **Normalize** — 13 rules transform both outputs into a canonical form:
145+
* Case, mnemonic format (`movel``move.l`), register prefix (`%d0``d0`)
146+
* Register aliases (`sp``a7`), hex prefix (`$1a``0x1a`)
147+
* Addressing mode syntax, immediate values, operand size suffixes
148+
* Register list compaction (`d0/d1/d2``d0-d2`)
149+
* Branch suffix normalization (`.b``.s` for Bcc only)
150+
* Implied-size stripping (`lea.l``lea`), mnemonic aliases (`dbra``dbf`)
151+
152+
4. **Compare** — Instruction-by-instruction with 7 classification categories:
153+
154+
| Category | Meaning |
155+
| ----------------------- | -------------------------------------------------------------- |
156+
| **match** | Identical normalized text and size |
157+
| **cosmetic\_diff** | Same size, different text after normalization |
158+
| **size\_mismatch** | Same text, different consumed bytes |
159+
| **known\_decode\_diff** | One side `dc.w`, other side ISA extension mnemonic |
160+
| **cascade** | Artifact in shadow range of prior size disagreement |
161+
| **tail\_data** | One-sided entries beyond the other disassembler's last address |
162+
| **failure** | Genuine disagreement |
163+
164+
1. **Report** — Per-ROM and aggregate reports in JSON + text.
165+
166+
## Metrics
167+
168+
* **match\_rate** = `matches / total`
169+
* **effective\_match\_rate** = `(matches + cosmetic_diffs + size_mismatches + known_decode_diffs) / total`
170+
* **tail\_data\_rate** = `tail_data / total` (reported separately, not in effective rate)
171+
* **failure\_rate** = `failures / total`
172+
173+
## Example Output
174+
175+
```
176+
================================================================================
177+
Aggregate Differential Test Summary
178+
================================================================================
179+
Total ROMs tested: 5
180+
Total instructions compared:12494
181+
Overall match rate: 82.19%
182+
Overall effective rate: 88.60%
183+
Overall tail data rate: 6.49%
184+
Overall failure rate: 2.49%
185+
```
186+

diff-test/arch_config.py

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
"""Architecture configuration registry for diff-test.
2+
3+
Defines ArchConfig dataclass and a complete registry of all m68k sub-architecture
4+
variants, mapping each to its Capstone mode and objdump machine string.
5+
6+
No capstone imports at module level — cs_arch and cs_mode are stored as plain
7+
integers so this module can be imported without capstone installed.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
from dataclasses import dataclass, field
13+
14+
# ---------------------------------------------------------------------------
15+
# Capstone mode constants (integer literals — no imports needed)
16+
# ---------------------------------------------------------------------------
17+
_BE = 1 << 31 # CS_MODE_BIG_ENDIAN = 0x80000000
18+
_M000 = 1 << 1 # CS_MODE_M68K_000
19+
_M010 = 1 << 2 # CS_MODE_M68K_010
20+
_M020 = 1 << 3 # CS_MODE_M68K_020
21+
_M030 = 1 << 4 # CS_MODE_M68K_030
22+
_M040 = 1 << 5 # CS_MODE_M68K_040
23+
_M060 = 1 << 6 # CS_MODE_M68K_060
24+
_MCPU32 = 1 << 7 # CS_MODE_M68K_CPU32
25+
26+
_CS_ARCH_M68K = 8 # CS_ARCH_M68K
27+
28+
29+
# ---------------------------------------------------------------------------
30+
# ArchConfig dataclass
31+
# ---------------------------------------------------------------------------
32+
@dataclass(frozen=True)
33+
class ArchConfig:
34+
"""Configuration for a single architecture variant."""
35+
36+
name: str
37+
cs_arch: int
38+
cs_mode: int
39+
objdump_machine: str
40+
objdump_mflags: list[str] = field(default_factory=lambda: ["motorola"])
41+
display_name: str = ""
42+
capstone_note: str = ""
43+
44+
def __post_init__(self) -> None:
45+
# Default display_name to name if not provided.
46+
if not self.display_name:
47+
object.__setattr__(self, "display_name", self.name)
48+
49+
50+
# ---------------------------------------------------------------------------
51+
# Helpers for building the registry
52+
# ---------------------------------------------------------------------------
53+
def _cfg(
54+
name: str,
55+
cs_mode: int,
56+
*,
57+
objdump_machine: str | None = None,
58+
display_name: str = "",
59+
capstone_note: str = "",
60+
) -> ArchConfig:
61+
"""Shorthand factory — all entries share cs_arch=8, mflags=["motorola"]."""
62+
return ArchConfig(
63+
name=name,
64+
cs_arch=_CS_ARCH_M68K,
65+
cs_mode=cs_mode,
66+
objdump_machine=objdump_machine if objdump_machine is not None else name,
67+
display_name=display_name or name,
68+
capstone_note=capstone_note,
69+
)
70+
71+
72+
_CF_NOTE_040 = "ColdFire: using CS_MODE_M68K_040 as closest match"
73+
_CF_NOTE_000 = "ColdFire ISA-A: using CS_MODE_M68K_000 as closest match"
74+
_CF_NOTE_020 = "ColdFire ISA-A+: using CS_MODE_M68K_020 as closest match"
75+
76+
77+
# ---------------------------------------------------------------------------
78+
# Architecture registry
79+
# ---------------------------------------------------------------------------
80+
ARCH_REGISTRY: dict[str, ArchConfig] = {}
81+
82+
83+
def _register(*configs: ArchConfig) -> None:
84+
for c in configs:
85+
ARCH_REGISTRY[c.name] = c
86+
87+
88+
# --- Classic 68k family (exact matches) -----------------------------------
89+
_register(
90+
_cfg("m68k", _BE | _M040, display_name="M68K (default/68040)"),
91+
_cfg("m68k:68000", _BE | _M000, display_name="M68K 68000"),
92+
_cfg("m68k:68008", _BE | _M000, display_name="M68K 68008",
93+
capstone_note="68008 is 68000-class"),
94+
_cfg("m68k:68010", _BE | _M010, display_name="M68K 68010"),
95+
_cfg("m68k:68020", _BE | _M020, display_name="M68K 68020"),
96+
_cfg("m68k:68030", _BE | _M030, display_name="M68K 68030"),
97+
_cfg("m68k:68040", _BE | _M040, display_name="M68K 68040"),
98+
_cfg("m68k:68060", _BE | _M060, display_name="M68K 68060"),
99+
_cfg("m68k:cpu32", _BE | _MCPU32, display_name="M68K CPU32"),
100+
_cfg("m68k:fido", _BE | _MCPU32, display_name="M68K Fido",
101+
capstone_note="Fido: using CS_MODE_M68K_CPU32 as closest match"),
102+
)
103+
104+
# --- ISA-A variants (68000-class ISA A) -----------------------------------
105+
_register(
106+
_cfg("m68k:isa-a:nodiv", _BE | _M000, capstone_note=_CF_NOTE_000),
107+
_cfg("m68k:isa-a", _BE | _M000, capstone_note=_CF_NOTE_000),
108+
_cfg("m68k:isa-a:mac", _BE | _M000, capstone_note=_CF_NOTE_000),
109+
_cfg("m68k:isa-a:emac", _BE | _M000, capstone_note=_CF_NOTE_000),
110+
)
111+
112+
# --- ISA-A+ variants (some 020 features) ----------------------------------
113+
_register(
114+
_cfg("m68k:isa-aplus", _BE | _M020, capstone_note=_CF_NOTE_020),
115+
_cfg("m68k:isa-aplus:mac", _BE | _M020, capstone_note=_CF_NOTE_020),
116+
_cfg("m68k:isa-aplus:emac", _BE | _M020, capstone_note=_CF_NOTE_020),
117+
)
118+
119+
# --- ISA-B variants (ColdFire ISA B, use 040 as closest) ------------------
120+
_register(
121+
_cfg("m68k:isa-b:nousp", _BE | _M040, capstone_note=_CF_NOTE_040),
122+
_cfg("m68k:isa-b:nousp:mac", _BE | _M040, capstone_note=_CF_NOTE_040),
123+
_cfg("m68k:isa-b:nousp:emac", _BE | _M040, capstone_note=_CF_NOTE_040),
124+
_cfg("m68k:isa-b", _BE | _M040, capstone_note=_CF_NOTE_040),
125+
_cfg("m68k:isa-b:mac", _BE | _M040, capstone_note=_CF_NOTE_040),
126+
_cfg("m68k:isa-b:emac", _BE | _M040, capstone_note=_CF_NOTE_040),
127+
_cfg("m68k:isa-b:float", _BE | _M040, capstone_note=_CF_NOTE_040),
128+
_cfg("m68k:isa-b:float:mac", _BE | _M040, capstone_note=_CF_NOTE_040),
129+
_cfg("m68k:isa-b:float:emac", _BE | _M040, capstone_note=_CF_NOTE_040),
130+
)
131+
132+
# --- ISA-C variants (ColdFire ISA C, use 040 as closest) ------------------
133+
_register(
134+
_cfg("m68k:isa-c", _BE | _M040, capstone_note=_CF_NOTE_040),
135+
_cfg("m68k:isa-c:mac", _BE | _M040, capstone_note=_CF_NOTE_040),
136+
_cfg("m68k:isa-c:emac", _BE | _M040, capstone_note=_CF_NOTE_040),
137+
_cfg("m68k:isa-c:nodiv", _BE | _M040, capstone_note=_CF_NOTE_040),
138+
_cfg("m68k:isa-c:nodiv:mac", _BE | _M040, capstone_note=_CF_NOTE_040),
139+
_cfg("m68k:isa-c:nodiv:emac", _BE | _M040, capstone_note=_CF_NOTE_040),
140+
)
141+
142+
# --- ColdFire numeric variants (all use 040 as closest) -------------------
143+
_register(
144+
_cfg("m68k:5200", _BE | _M040, capstone_note=_CF_NOTE_040),
145+
_cfg("m68k:5206e", _BE | _M040, capstone_note=_CF_NOTE_040),
146+
_cfg("m68k:5307", _BE | _M040, capstone_note=_CF_NOTE_040),
147+
_cfg("m68k:5407", _BE | _M040, capstone_note=_CF_NOTE_040),
148+
_cfg("m68k:528x", _BE | _M040, capstone_note=_CF_NOTE_040),
149+
_cfg("m68k:521x", _BE | _M040, capstone_note=_CF_NOTE_040),
150+
_cfg("m68k:5249", _BE | _M040, capstone_note=_CF_NOTE_040),
151+
_cfg("m68k:547x", _BE | _M040, capstone_note=_CF_NOTE_040),
152+
_cfg("m68k:548x", _BE | _M040, capstone_note=_CF_NOTE_040),
153+
_cfg("m68k:cfv4e", _BE | _M040, capstone_note=_CF_NOTE_040),
154+
)
155+
156+
157+
# ---------------------------------------------------------------------------
158+
# Default architecture
159+
# ---------------------------------------------------------------------------
160+
DEFAULT_ARCH = "m68k"
161+
162+
163+
# ---------------------------------------------------------------------------
164+
# Public API
165+
# ---------------------------------------------------------------------------
166+
def get_arch(name: str) -> ArchConfig:
167+
"""Look up an architecture by name.
168+
169+
Raises ``ValueError`` if *name* is not in the registry.
170+
"""
171+
try:
172+
return ARCH_REGISTRY[name]
173+
except KeyError:
174+
raise ValueError(
175+
f"Unknown architecture: {name!r}. "
176+
"Use list_archs() to see supported architectures."
177+
) from None
178+
179+
180+
def list_archs() -> list[str]:
181+
"""Return all supported architecture names, sorted."""
182+
return sorted(ARCH_REGISTRY)
183+
184+
185+
# ---------------------------------------------------------------------------
186+
# CLI: print a nice table when run directly
187+
# ---------------------------------------------------------------------------
188+
if __name__ == "__main__":
189+
hdr_name = "Architecture"
190+
hdr_mode = "cs_mode"
191+
hdr_mach = "objdump -m"
192+
hdr_note = "Note"
193+
194+
entries = [ARCH_REGISTRY[k] for k in list_archs()]
195+
196+
w_name = max(len(hdr_name), *(len(e.name) for e in entries))
197+
w_mode = max(len(hdr_mode), 12) # "0x80000020" is 10 chars, pad a bit
198+
w_mach = max(len(hdr_mach), *(len(e.objdump_machine) for e in entries))
199+
w_note = max(len(hdr_note), *(len(e.capstone_note) for e in entries))
200+
201+
fmt = f" {{:<{w_name}}} {{:<{w_mode}}} {{:<{w_mach}}} {{}}"
202+
sep = fmt.format("-" * w_name, "-" * w_mode, "-" * w_mach, "-" * w_note)
203+
204+
print(f"\nRegistered architectures ({len(entries)}):\n")
205+
print(fmt.format(hdr_name, hdr_mode, hdr_mach, hdr_note))
206+
print(sep)
207+
for e in entries:
208+
print(fmt.format(e.name, f"0x{e.cs_mode:08x}", e.objdump_machine, e.capstone_note))
209+
print()

0 commit comments

Comments
 (0)