feat: remove memory access adapters for address space != 4#2382
feat: remove memory access adapters for address space != 4#2382jonathanpwang wants to merge 9 commits intodevelop-v1.6.0from
!= 4#2382Conversation
…alize memory (#2318) Closes INT-5723, INT-5724, INT-5726 --------- Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Resolves INT-5728, INT-5727, INT-5729. Summary of changes: - Everywhere the code used `Rv32HeapAdapterAir`, we switch to use `Rv32VecHeapAdapterAir`. - Everywhere the code used `Rv32HeapBranchAdapterAir`, we switch to use a new `Rv32HeapBranchAdapterAir`, which accesses memory in the same way as `Rv32VecHeapAdapterAir`, but is compatible with the branch CoreAirs. - No other code uses `Rv32HeapAdapterAir` and `Rv32HeapBranchAdapterAir`, so the `heap.rs` and `heap_branch.rs` files were deleted. - The interface for `Rv32VecHeapAdapterAir` and `Rv32HeapBranchAdapterAir ` now becomes different to what the CoreAirs expect, so wrappers in `vec_to_heap.rs` are used to convert between them. --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Resolves INT-5950. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Remove CUDA code from algebra and ecc extensions, since the production code currenly uses hybrid by default. Cuda tests are switched to using hybrid chips instead of gpu chips.
Resolves INT-5949 INT-5952 INT-5951 INT-5948. --------- Co-authored-by: Paul Chen <chenpaul.pc@gmail.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
CodSpeed Performance ReportMerging this PR will degrade performance by 83.36%Comparing
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | WallTime | benchmark_execute_metered[fibonacci_recursive] |
16.9 ms | 53.7 ms | -68.44% |
| ❌ | WallTime | benchmark_execute_metered[revm_transfer] |
24.9 ms | 61.9 ms | -59.79% |
| ❌ | WallTime | benchmark_execute_metered[sha256] |
9.3 ms | 45.1 ms | -79.34% |
| ❌ | WallTime | benchmark_execute_metered[keccak256_iter] |
71.7 ms | 192.1 ms | -62.7% |
| ❌ | WallTime | benchmark_execute_metered[revm_snailtracer] |
7.3 ms | 44 ms | -83.36% |
| ❌ | WallTime | benchmark_execute_metered[keccak256] |
11.1 ms | 49.9 ms | -77.73% |
| ❌ | WallTime | benchmark_execute_metered[factorial_iterative_u256] |
74.4 ms | 150.9 ms | -50.71% |
| ❌ | WallTime | benchmark_execute_metered[fibonacci_iterative] |
14.3 ms | 50 ms | -71.35% |
| ❌ | WallTime | benchmark_execute_metered[quicksort] |
9 ms | 45.8 ms | -80.31% |
| ❌ | WallTime | benchmark_execute_metered[bubblesort] |
11.3 ms | 47.8 ms | -76.44% |
Footnotes
-
No successful run was found on
develop-v1.6.0(5a93213) during the generation of this report, so 95fdcd5 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩ -
36 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Resolves INT-6010. Currently, `AccessAdapterAir` is excluded from the metered execution checks iff `access_adapters_enabled` is true.
Cast usize pointers to u32 before to_le_bytes() to produce 4-byte arrays compatible with BLOCK_SIZE=4 memory configuration.
Resolves INT-5725.
## Background: PersistentBoundaryAir Update (not in this PR)
The `feat/access-adapter-removal` branch removes memory access adapters
from the persistent memory path. Previously, separate `AccessAdapterAir`
circuits handled the conversion between `CONST_BLOCK_SIZE=4` (the
granularity of memory bus interactions) and `CHUNK=8` (the granularity
of Merkle tree hashing/Poseidon2 digests).
The updated `PersistentBoundaryAir` eliminates this by operating on
8-byte chunks directly while tracking **per-sub-block timestamps**:
```
Old PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamp
^^^^^^^^^
single timestamp
New PersistentBoundaryCols:
expand_direction | address_space | leaf_label | values[8] | hash[8] | timestamps[2]
^^^^^^^^^^^^^
one per 4-byte block
```
Each 8-byte chunk contains `BLOCKS_PER_CHUNK = CHUNK / CONST_BLOCK_SIZE
= 2` sub-blocks. The boundary AIR emits **two** memory bus interactions
per row (one per 4-byte sub-block), each with its own timestamp.
Untouched sub-blocks within a touched chunk keep `timestamp=0`, which
naturally balances against the initial-state row (also at `t=0`).
---
## This PR: GPU Trace Generation for the Updated Boundary Chip
This PR adapts the GPU trace generation pipeline to the new
per-sub-block-timestamp design. The core challenge: the CPU-side
"touched memory" partition arrives as sorted **4-byte records**, but the
boundary chip and Merkle tree need **8-byte records** with per-block
timestamps and initial-memory fill for untouched sub-blocks.
### New CUDA Kernel: `inventory.cu` — Merge Records on GPU
A new `merge_records` kernel converts `InRec = MemoryInventoryRecord<4,
1>` into `OutRec = MemoryInventoryRecord<8, 2>` in two phases:
1. **`cukernel_build_candidates`** — Each thread inspects one input
record. If it starts a new 8-byte output chunk (different `ptr/8` than
the previous record), it:
- Reads the full 8-byte initial memory from device
- Overwrites the touched 4-byte sub-block with final values + timestamp
- If the next input record belongs to the same output chunk, also
patches the other sub-block
- Sets `flag[i] = 1`; otherwise `flag[i] = 0` (duplicate within same
chunk)
2. **`cukernel_scatter_compact`** — CUB `ExclusiveSum` on flags produces
output positions; flagged records are scattered into a compact output
array.
### Updated `boundary.cu` — Per-Block Timestamps
The `BoundaryRecord` struct is parameterized on `BLOCKS`:
```c++
template <size_t CHUNK, size_t BLOCKS> struct BoundaryRecord {
uint32_t address_space;
uint32_t ptr;
uint32_t timestamps[BLOCKS]; // was: uint32_t timestamp;
uint32_t values[CHUNK];
};
```
The persistent trace gen kernel writes `timestamps=[0,0]` for initial
rows and the actual per-block timestamps for final rows.
### Updated `memory.rs` — Host-Side Orchestration
The Rust side:
1. Converts `TimestampedEquipartition<F, CONST_BLOCK_SIZE>` into
GPU-compatible `MemoryInventoryRecord<4,1>` structs
2. Uploads to device and calls the merge kernel
3. Sends merged records to the boundary chip
(`finalize_records_persistent_device`)
4. Converts merged records into Merkle records with `timestamp =
max(timestamps[0], timestamps[1])` for the tree update
---
## Walkthrough: Sample Trace
Suppose the VM touches 3 memory cells at `CONST_BLOCK_SIZE=4`
granularity:
| addr_space | ptr | timestamp | values |
|------------|-----|-----------|-----------------|
| 1 | 0 | 5 | [1, 2, 3, 4] |
| 1 | 4 | 10 | [5, 6, 7, 8] |
| 1 | 16 | 3 | [9, 0, 0, 0] |
Initial memory is all zeros.
### Step 1 — Convert to `InRec<4, 1>`
```
InRec[0]: { as=1, ptr=0, timestamps=[5], values=[1,2,3,4] }
InRec[1]: { as=1, ptr=4, timestamps=[10], values=[5,6,7,8] }
InRec[2]: { as=1, ptr=16, timestamps=[3], values=[9,0,0,0] }
```
### Step 2 — GPU `cukernel_build_candidates`
Each thread computes `output_chunk = ptr / 8`:
| idx | ptr | output_chunk | same as prev? | flag |
|-----|-----|-------------|---------------|------|
| 0 | 0 | 0 | N/A (first) | 1 |
| 1 | 4 | 0 | yes | 0 |
| 2 | 16 | 2 | no | 1 |
**Thread 0** (flag=1): Builds an `OutRec` for chunk `ptr=0`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (0 % 8) / 4 = 0` → patches `values[0..4] = [1,2,3,4]`,
`timestamps[0] = 5`
- Next record (idx=1) is same chunk: `block_idx2 = (4 % 8) / 4 = 1` →
patches `values[4..8] = [5,6,7,8]`, `timestamps[1] = 10`
- Result: `{ as=1, ptr=0, timestamps=[5,10], values=[1,2,3,4,5,6,7,8] }`
**Thread 1** (flag=0): Skipped (same output chunk as thread 0).
**Thread 2** (flag=1): Builds an `OutRec` for chunk `ptr=16`:
- Reads initial memory `[0,0,0,0,0,0,0,0]` from device
- `block_idx = (16 % 8) / 4 = 0` → patches `values[0..4] = [9,0,0,0]`,
`timestamps[0] = 3`
- No next record → `timestamps[1] = 0`, `values[4..8] = [0,0,0,0]` (from
initial memory)
- Result: `{ as=1, ptr=16, timestamps=[3,0], values=[9,0,0,0,0,0,0,0] }`
### Step 3 — Prefix sum + scatter compact
```
flags = [1, 0, 1]
positions = [0, 1, 1] (exclusive prefix sum)
out[0] = thread 0's record
out[1] = thread 2's record
out_num_records = 2
```
### Step 4 — Boundary chip trace (2 rows per record = 4 active rows)
| Row | expand_dir | as | leaf_label | values | hash | timestamps |
|-----|------------|----|------------|---------------------------|---------------|------------|
| 0 | +1 (init) | 1 | 0 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 1 | -1 (final) | 1 | 0 | [1,2,3,4,5,6,7,8] | H([1,..,8]) | [5, 10] |
| 2 | +1 (init) | 1 | 2 | [0,0,0,0,0,0,0,0] | H([0,..0]) | [0, 0] |
| 3 | -1 (final) | 1 | 2 | [9,0,0,0,0,0,0,0] | H([9,0,..0]) | [3, 0] |
Each final row generates **two** memory bus sends:
- Row 1: send `(as=1, ptr=0, values=[1,2,3,4], ts=5)` and `(as=1, ptr=4,
values=[5,6,7,8], ts=10)`
- Row 3: send `(as=1, ptr=16, values=[9,0,0,0], ts=3)` and `(as=1,
ptr=20, values=[0,0,0,0], ts=0)`
The `ts=0` sends from initial rows balance against the `ts=0` sub-blocks
of the final rows for untouched memory, eliminating the need for access
adapters.
### Step 5 — Merkle tree records
For the Merkle tree, each record uses a single timestamp =
`max(timestamps[0], timestamps[1])`:
| as | ptr | merkle_ts | values |
|----|-----|-----------|---------------------------|
| 1 | 0 | 10 | [1,2,3,4,5,6,7,8] |
| 1 | 16 | 3 | [9,0,0,0,0,0,0,0] |
These feed into the existing `update_with_touched_blocks` for Merkle
root computation.
---
## Other Changes
- **`merkle_tree/mod.rs`**: Added `MERKLE_TOUCHED_BLOCK_WIDTH = 3 +
DIGEST_WIDTH` constant (distinct from `TIMESTAMPED_BLOCK_WIDTH = 3 +
CONST_BLOCK_SIZE`) since the Merkle tree now consumes 8-value records
directly. Also fixed a potential `ilog2(0)` panic in
`calculate_unpadded_height`.
- **New test** `test_empty_touched_memory_uses_full_chunk_values`
validates that the empty-partition edge case correctly reads initial
memory at `DIGEST_WIDTH` granularity and produces a matching Merkle root
vs CPU.
---------
Co-authored-by: Jonathan Wang <31040440+jonathanpwang@users.noreply.github.com>
Co-authored-by: Golovanov399 <Golovanov12345@gmail.com>
To be rebase merged.
Address space 4 will be removed after OpenVM 2.0.
We remove access adapters because while they provide the theoretical lowest trace sizes, they incur penalties for execution -- specifically metered execution -- as well as significantly complicating the codebase.