Skip to content

feat: remove memory access adapters for address space != 4#2382

Open
jonathanpwang wants to merge 9 commits intodevelop-v1.6.0from
feat/access-adapter-removal
Open

feat: remove memory access adapters for address space != 4#2382
jonathanpwang wants to merge 9 commits intodevelop-v1.6.0from
feat/access-adapter-removal

Conversation

@jonathanpwang
Copy link
Contributor

To be rebase merged.

Address space 4 will be removed after OpenVM 2.0.

We remove access adapters because while they provide the theoretical lowest trace sizes, they incur penalties for execution -- specifically metered execution -- as well as significantly complicating the codebase.

Maillew and others added 5 commits January 30, 2026 04:24
…alize memory (#2318)

Closes INT-5723, INT-5724, INT-5726

---------

Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Resolves INT-5728, INT-5727, INT-5729.

Summary of changes:

- Everywhere the code used `Rv32HeapAdapterAir`, we switch to use
`Rv32VecHeapAdapterAir`.
- Everywhere the code used `Rv32HeapBranchAdapterAir`, we switch to use
a new `Rv32HeapBranchAdapterAir`, which accesses memory in the same way
as `Rv32VecHeapAdapterAir`, but is compatible with the branch CoreAirs.
- No other code uses `Rv32HeapAdapterAir` and
`Rv32HeapBranchAdapterAir`, so the `heap.rs` and `heap_branch.rs` files
were deleted.
- The interface for `Rv32VecHeapAdapterAir` and
`Rv32HeapBranchAdapterAir ` now becomes different to what the CoreAirs
expect, so wrappers in `vec_to_heap.rs` are used to convert between
them.

---------

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Resolves INT-5950.

Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Remove CUDA code from algebra and ecc extensions, since the production
code currenly uses hybrid by default. Cuda tests are switched to using
hybrid chips instead of gpu chips.
Resolves INT-5949 INT-5952 INT-5951 INT-5948.

---------

Co-authored-by: Paul Chen <chenpaul.pc@gmail.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
@codspeed-hq
Copy link

codspeed-hq bot commented Jan 30, 2026

CodSpeed Performance Report

Merging this PR will degrade performance by 83.36%

Comparing feat/access-adapter-removal (764f676) with develop-v1.6.0 (6dc3800)1

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

Summary

❌ 10 regressed benchmarks
✅ 14 untouched benchmarks
⏩ 36 skipped benchmarks2

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime benchmark_execute_metered[fibonacci_recursive] 16.9 ms 53.7 ms -68.44%
WallTime benchmark_execute_metered[revm_transfer] 24.9 ms 61.9 ms -59.79%
WallTime benchmark_execute_metered[sha256] 9.3 ms 45.1 ms -79.34%
WallTime benchmark_execute_metered[keccak256_iter] 71.7 ms 192.1 ms -62.7%
WallTime benchmark_execute_metered[revm_snailtracer] 7.3 ms 44 ms -83.36%
WallTime benchmark_execute_metered[keccak256] 11.1 ms 49.9 ms -77.73%
WallTime benchmark_execute_metered[factorial_iterative_u256] 74.4 ms 150.9 ms -50.71%
WallTime benchmark_execute_metered[fibonacci_iterative] 14.3 ms 50 ms -71.35%
WallTime benchmark_execute_metered[quicksort] 9 ms 45.8 ms -80.31%
WallTime benchmark_execute_metered[bubblesort] 11.3 ms 47.8 ms -76.44%

Footnotes

  1. No successful run was found on develop-v1.6.0 (5a93213) during the generation of this report, so 95fdcd5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

  2. 36 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

876pol and others added 4 commits January 30, 2026 11:05
Resolves INT-6010.

Currently, `AccessAdapterAir` is excluded from the metered execution
checks iff `access_adapters_enabled` is true.
Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays
compatible with BLOCK_SIZE=4 memory configuration.
Resolves INT-5725.

## Background: PersistentBoundaryAir Update (not in this PR)

The `feat/access-adapter-removal` branch removes memory access adapters
from the persistent memory path. Previously, separate `AccessAdapterAir`
circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the
granularity of memory bus interactions) and `CHUNK=8` (the granularity
of Merkle tree hashing/Poseidon2 digests).

The updated `PersistentBoundaryAir` eliminates this by operating on
8-byte chunks directly while tracking **per-sub-block timestamps**:

```
Old PersistentBoundaryCols:
  expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp
                                                                         ^^^^^^^^^
                                                                      single timestamp

New PersistentBoundaryCols:
  expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2]
                                                                         ^^^^^^^^^^^^^
                                                                      one per 4-byte block
```

Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE
= 2` sub-blocks. The boundary AIR emits **two** memory bus interactions
per row (one per 4-byte sub-block), each with its own timestamp.
Untouched sub-blocks within a touched chunk keep `timestamp=0`, which
naturally balances against the initial-state row (also at `t=0`).

---

## This PR: GPU Trace Generation for the Updated Boundary Chip

This PR adapts the GPU trace generation pipeline to the new
per-sub-block-timestamp design. The core challenge: the CPU-side
"touched memory" partition arrives as sorted **4-byte records**, but the
boundary chip and Merkle tree need **8-byte records** with per-block
timestamps and initial-memory fill for untouched sub-blocks.

### New CUDA Kernel: `inventory.cu` — Merge Records on GPU

A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4,
1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases:

1. **`cukernel_build_candidates`** — Each thread inspects one input
record. If it starts a new 8-byte output chunk (different `ptr/8` than
the previous record), it:
   - Reads the full 8-byte initial memory from device
- Overwrites the touched 4-byte sub-block with final values + timestamp
- If the next input record belongs to the same output chunk, also
patches the other sub-block
- Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same
chunk)

2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces
output positions; flagged records are scattered into a compact output
array.

### Updated `boundary.cu` — Per-Block Timestamps

The `BoundaryRecord` struct is parameterized on `BLOCKS`:

```c++
template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord {
    uint32_t address_space;
    uint32_t ptr;
    uint32_t timestamps[BLOCKS];   // was: uint32_t timestamp;
    uint32_t values[CHUNK];
};
```

The persistent trace gen kernel writes `timestamps=[0,0]` for initial
rows and the actual per-block timestamps for final rows.

### Updated `memory.rs` — Host-Side Orchestration

The Rust side:
1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into
GPU-compatible `MemoryInventoryRecord<4,1>` structs
2. Uploads to device and calls the merge kernel
3. Sends merged records to the boundary chip
(`finalize_records_persistent_device`)
4. Converts merged records into Merkle records with `timestamp =
max(timestamps[0], timestamps[1])` for the tree update

---

## Walkthrough: Sample Trace

Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4`
granularity:

| addr_space | ptr | timestamp | values          |
|------------|-----|-----------|-----------------|
| 1          | 0   | 5         | [1, 2, 3, 4]   |
| 1          | 4   | 10        | [5, 6, 7, 8]   |
| 1          | 16  | 3         | [9, 0, 0, 0]   |

Initial memory is all zeros.

### Step 1 — Convert to `InRec<4, 1>`

```
InRec[0]: { as=1, ptr=0,  timestamps=[5],  values=[1,2,3,4] }
InRec[1]: { as=1, ptr=4,  timestamps=[10], values=[5,6,7,8] }
InRec[2]: { as=1, ptr=16, timestamps=[3],  values=[9,0,0,0] }
```

### Step 2 — GPU `cukernel_build_candidates`

Each thread computes `output_chunk = ptr / 8`:

| idx | ptr | output_chunk | same as prev? | flag |
|-----|-----|-------------|---------------|------|
| 0   | 0   | 0           | N/A (first)   | 1    |
| 1   | 4   | 0           | yes           | 0    |
| 2   | 16  | 2           | no            | 1    |

**Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`,
`timestamps[0] = 5`
- Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` →
patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10`
- Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }`

**Thread 1** (flag=0): Skipped (same output chunk as thread 0).

**Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`,
`timestamps[0] = 3`
- No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from
initial memory)
- Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }`

### Step 3 — Prefix sum + scatter compact

```
flags     = [1, 0, 1]
positions = [0, 1, 1]  (exclusive prefix sum)

out[0] = thread 0's record
out[1] = thread 2's record
out_num_records = 2
```

### Step 4 — Boundary chip trace (2 rows per record = 4 active rows)

| Row | expand_dir | as | leaf_label | values | hash | timestamps |

|-----|------------|----|------------|---------------------------|---------------|------------|
| 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] |
| 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] |

Each final row generates **two** memory bus sends:
- Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4,
values=[5,6,7,8], ts=10)`
- Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1,
ptr=20, values=[0,0,0,0], ts=0)`

The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks
of the final rows for untouched memory, eliminating the need for access
adapters.

### Step 5 — Merkle tree records

For the Merkle tree, each record uses a single timestamp =
`max(timestamps[0], timestamps[1])`:

| as | ptr | merkle_ts | values                    |
|----|-----|-----------|---------------------------|
| 1  | 0   | 10        | [1,2,3,4,5,6,7,8]        |
| 1  | 16  | 3         | [9,0,0,0,0,0,0,0]        |

These feed into the existing `update_with_touched_blocks` for Merkle
root computation.

---

## Other Changes

- **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 +
DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 +
CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records
directly. Also fixed a potential `ilog2(0)` panic in
`calculate_unpadded_height`.
- **New test** `test_empty_touched_memory_uses_full_chunk_values`
validates that the empty-partition edge case correctly reads initial
memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root
vs CPU.

---------

Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants