feat: remove memory access adapters for address space `!= 4` by jonathanpwang · Pull Request #2382 · openvm-org/openvm

jonathanpwang · 2026-01-30T04:43:38Z

To be rebase merged.

Address space 4 will be removed after OpenVM 2.0.

We remove access adapters because while they provide the theoretical lowest trace sizes, they incur penalties for execution -- specifically metered execution -- as well as significantly complicating the codebase.

…alize memory (#2318) Closes INT-5723, INT-5724, INT-5726 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

Resolves INT-5728, INT-5727, INT-5729. Summary of changes: - Everywhere the code used `Rv32HeapAdapterAir`, we switch to use `Rv32VecHeapAdapterAir`. - Everywhere the code used `Rv32HeapBranchAdapterAir`, we switch to use a new `Rv32HeapBranchAdapterAir`, which accesses memory in the same way as `Rv32VecHeapAdapterAir`, but is compatible with the branch CoreAirs. - No other code uses `Rv32HeapAdapterAir` and `Rv32HeapBranchAdapterAir`, so the `heap.rs` and `heap_branch.rs` files were deleted. - The interface for `Rv32VecHeapAdapterAir` and `Rv32HeapBranchAdapterAir ` now becomes different to what the CoreAirs expect, so wrappers in `vec_to_heap.rs` are used to convert between them. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Resolves INT-5950. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

Remove CUDA code from algebra and ecc extensions, since the production code currenly uses hybrid by default. Cuda tests are switched to using hybrid chips instead of gpu chips.

Resolves INT-5949 INT-5952 INT-5951 INT-5948. --------- Co-authored-by: Paul Chen <chenpaul.pc@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

codspeed-hq · 2026-01-30T04:59:33Z

CodSpeed Performance Report

Merging this PR will degrade performance by 83.36%

_{Comparing feat/access-adapter-removal (764f676) with develop-v1.6.0 (6dc3800)¹}

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

❌ 10 regressed benchmarks
✅ 14 untouched benchmarks
⏩ 36 skipped benchmarks²

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	WallTime	`benchmark_execute_metered[fibonacci_recursive]`	16.9 ms	53.7 ms	-68.44%
❌	WallTime	`benchmark_execute_metered[revm_transfer]`	24.9 ms	61.9 ms	-59.79%
❌	WallTime	`benchmark_execute_metered[sha256]`	9.3 ms	45.1 ms	-79.34%
❌	WallTime	`benchmark_execute_metered[keccak256_iter]`	71.7 ms	192.1 ms	-62.7%
❌	WallTime	`benchmark_execute_metered[revm_snailtracer]`	7.3 ms	44 ms	-83.36%
❌	WallTime	`benchmark_execute_metered[keccak256]`	11.1 ms	49.9 ms	-77.73%
❌	WallTime	`benchmark_execute_metered[factorial_iterative_u256]`	74.4 ms	150.9 ms	-50.71%
❌	WallTime	`benchmark_execute_metered[fibonacci_iterative]`	14.3 ms	50 ms	-71.35%
❌	WallTime	`benchmark_execute_metered[quicksort]`	9 ms	45.8 ms	-80.31%
❌	WallTime	`benchmark_execute_metered[bubblesort]`	11.3 ms	47.8 ms	-76.44%

No successful run was found on develop-v1.6.0 (5a93213) during the generation of this report, so 95fdcd5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩
36 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Resolves INT-6010. Currently, `AccessAdapterAir` is excluded from the metered execution checks iff `access_adapters_enabled` is true.

Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays compatible with BLOCK_SIZE=4 memory configuration.

Resolves INT-5725. ## Background: PersistentBoundaryAir Update (not in this PR) The `feat/access-adapter-removal` branch removes memory access adapters from the persistent memory path. Previously, separate `AccessAdapterAir` circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the granularity of memory bus interactions) and `CHUNK=8` (the granularity of Merkle tree hashing/Poseidon2 digests). The updated `PersistentBoundaryAir` eliminates this by operating on 8-byte chunks directly while tracking **per-sub-block timestamps**: ``` Old PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp ^^^^^^^^^ single timestamp New PersistentBoundaryCols: expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2] ^^^^^^^^^^^^^ one per 4-byte block ``` Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE = 2` sub-blocks. The boundary AIR emits **two** memory bus interactions per row (one per 4-byte sub-block), each with its own timestamp. Untouched sub-blocks within a touched chunk keep `timestamp=0`, which naturally balances against the initial-state row (also at `t=0`). --- ## This PR: GPU Trace Generation for the Updated Boundary Chip This PR adapts the GPU trace generation pipeline to the new per-sub-block-timestamp design. The core challenge: the CPU-side "touched memory" partition arrives as sorted **4-byte records**, but the boundary chip and Merkle tree need **8-byte records** with per-block timestamps and initial-memory fill for untouched sub-blocks. ### New CUDA Kernel: `inventory.cu` — Merge Records on GPU A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4, 1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases: 1. **`cukernel_build_candidates`** — Each thread inspects one input record. If it starts a new 8-byte output chunk (different `ptr/8` than the previous record), it: - Reads the full 8-byte initial memory from device - Overwrites the touched 4-byte sub-block with final values + timestamp - If the next input record belongs to the same output chunk, also patches the other sub-block - Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same chunk) 2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces output positions; flagged records are scattered into a compact output array. ### Updated `boundary.cu` — Per-Block Timestamps The `BoundaryRecord` struct is parameterized on `BLOCKS`: ```c++ template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord { uint32_t address_space; uint32_t ptr; uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp; uint32_t values[CHUNK]; }; ``` The persistent trace gen kernel writes `timestamps=[0,0]` for initial rows and the actual per-block timestamps for final rows. ### Updated `memory.rs` — Host-Side Orchestration The Rust side: 1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into GPU-compatible `MemoryInventoryRecord<4,1>` structs 2. Uploads to device and calls the merge kernel 3. Sends merged records to the boundary chip (`finalize_records_persistent_device`) 4. Converts merged records into Merkle records with `timestamp = max(timestamps[0], timestamps[1])` for the tree update --- ## Walkthrough: Sample Trace Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4` granularity: | addr_space | ptr | timestamp | values | |------------|-----|-----------|-----------------| | 1 | 0 | 5 | [1, 2, 3, 4] | | 1 | 4 | 10 | [5, 6, 7, 8] | | 1 | 16 | 3 | [9, 0, 0, 0] | Initial memory is all zeros. ### Step 1 — Convert to `InRec<4, 1>` ``` InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] } InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] } InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] } ``` ### Step 2 — GPU `cukernel_build_candidates` Each thread computes `output_chunk = ptr / 8`: | idx | ptr | output_chunk | same as prev? | flag | |-----|-----|-------------|---------------|------| | 0 | 0 | 0 | N/A (first) | 1 | | 1 | 4 | 0 | yes | 0 | | 2 | 16 | 2 | no | 1 | **Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`, `timestamps[0] = 5` - Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` → patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10` - Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }` **Thread 1** (flag=0): Skipped (same output chunk as thread 0). **Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`: - Reads initial memory `[0,0,0,0,0,0,0,0]` from device - `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`, `timestamps[0] = 3` - No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from initial memory) - Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }` ### Step 3 — Prefix sum + scatter compact ``` flags = [1, 0, 1] positions = [0, 1, 1] (exclusive prefix sum) out[0] = thread 0's record out[1] = thread 2's record out_num_records = 2 ``` ### Step 4 — Boundary chip trace (2 rows per record = 4 active rows) | Row | expand_dir | as | leaf_label | values | hash | timestamps | |-----|------------|----|------------|---------------------------|---------------|------------| | 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] | | 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] | | 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] | Each final row generates **two** memory bus sends: - Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4, values=[5,6,7,8], ts=10)` - Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1, ptr=20, values=[0,0,0,0], ts=0)` The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks of the final rows for untouched memory, eliminating the need for access adapters. ### Step 5 — Merkle tree records For the Merkle tree, each record uses a single timestamp = `max(timestamps[0], timestamps[1])`: | as | ptr | merkle_ts | values | |----|-----|-----------|---------------------------| | 1 | 0 | 10 | [1,2,3,4,5,6,7,8] | | 1 | 16 | 3 | [9,0,0,0,0,0,0,0] | These feed into the existing `update_with_touched_blocks` for Merkle root computation. --- ## Other Changes - **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 + DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 + CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records directly. Also fixed a potential `ilog2(0)` panic in `calculate_unpadded_height`. - **New test** `test_empty_touched_memory_uses_full_chunk_values` validates that the empty-partition edge case correctly reads initial memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root vs CPU. --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com> Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>

Maillew and others added 5 commits January 30, 2026 04:24

feat: remove memory access adapters: CPU Boundary AIR, initialize/fin…

8e64789

…alize memory (#2318) Closes INT-5723, INT-5724, INT-5726 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>

feat: access adapter removal sha2 (#2375)

ffe57e6

Resolves INT-5950. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

feat: remove cuda from algebra and ecc extensions (#2379)

4ff9530

Remove CUDA code from algebra and ecc extensions, since the production code currenly uses hybrid by default. Cuda tests are switched to using hybrid chips instead of gpu chips.

876pol and others added 4 commits January 30, 2026 11:05

feat: verify metered execution works properly in tests (#2376)

e96991a

Resolves INT-6010. Currently, `AccessAdapterAir` is excluded from the metered execution checks iff `access_adapters_enabled` is true.

fix: exclude VolatileBoundaryAir from metering tester (#2385)

309a675

feat: access adapter removal keccak256 (#2384)

ac1ac39

Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays compatible with BLOCK_SIZE=4 memory configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove memory access adapters for address space `!= 4`#2382

feat: remove memory access adapters for address space `!= 4`#2382
jonathanpwang wants to merge 9 commits intodevelop-v1.6.0from
feat/access-adapter-removal

jonathanpwang commented Jan 30, 2026

Uh oh!

codspeed-hq bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jonathanpwang commented Jan 30, 2026

Uh oh!

codspeed-hq bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging this PR will degrade performance by 83.36%

Summary

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codspeed-hq bot commented Jan 30, 2026 •

edited

Loading