Skip to content

Hasher chiplet redesign#2927

Open
Al-Kindi-0 wants to merge 12 commits intonextfrom
al-hasher-chiplet-redesign
Open

Hasher chiplet redesign#2927
Al-Kindi-0 wants to merge 12 commits intonextfrom
al-hasher-chiplet-redesign

Conversation

@Al-Kindi-0
Copy link
Copy Markdown
Contributor

This PR bundles several closely related changes in the hasher / chiplets area:

  • Hasher controller/permutation split -- the hasher trace is split into a compact controller region and a separate permutation segment, enabling permutation deduplication.
  • Packed 16-row Poseidon2 permutation segment -- the 31-step Poseidon2 schedule is packed from 32 rows down to 16 rows per unique permutation.
  • Sibling table soundness fix (#2220) -- a new mrupdate_id column domain-separates sibling-table entries, preventing cross-operation sibling swapping.
  • Memory address range checks (#1614) -- the memory chiplet gets w0/w1 address-limb columns, with 16-bit range-check lookups routed through the wiring bus.

These changes all touch the chiplets trace layout, bus plumbing, and AIR structure, so landing them together keeps the transition coherent.

Why

1. Deduplicate repeated permutations

The old monolithic hasher consumed 32 rows per permutation request, even if the same input state appeared repeatedly.

With the new design:

  • the controller records each request as a 2-row (input, output) pair,
  • the permutation segment executes one packed 16-row cycle per unique input state,
  • a multiplicity counter records how many controller pairs map to the same cycle.

For M requests with U unique input states, the rough cost changes from:

  • old: 32M
  • new: 2M + pad_to_16 + 16U

This is a clear win whenever states repeat (Merkle workloads, identical MAST roots, ...).

2. Fix sibling-table soundness

The old sibling-table encoding was vulnerable to cross-operation sibling reuse. Adding mrupdate_id domain-separates entries so sibling-table balance is enforced per MRUPDATE instance, not globally across unrelated operations.

3. Add memory address decomposition checks

The memory chiplet now decomposes word addresses into two 16-bit limbs and proves the decomposition using range-check lookups. This closes an important missing piece in memory soundness while reusing the existing wiring-bus infrastructure.

Design

Hasher: two-region trace

The hasher trace is split into two contiguous regions:

  • Controller (perm_seg = 0)
    Compact input/output row pairs, one pair per permutation request.

  • Permutation segment (perm_seg = 1)
    One packed 16-row Poseidon2 cycle per unique input state.

A LogUp permutation-link on the shared V_WIRING auxiliary column ties controller requests to the corresponding permutation cycles.

Packed 16-row Poseidon2 schedule

The 31-step Poseidon2 schedule is packed as:

  • row 0: init + ext1
  • rows 1-3: ext2..ext4
  • rows 4-10: 7 × (3 packed internal rounds)
  • row 11: int22 + ext5
  • rows 12-14: ext6..ext8
  • row 15: boundary / final state

Packed internal rows use s0/s1/s2 as witness columns on permutation rows in order to keep constraints degree bounded. Unused witness slots are explicitly zero-constrained (out of caution) though this could be relaxed.

Column layout

Hasher: 16 -> 20

s0 s1 s2 | h0..h11 | node_index | mrupdate_id | is_boundary | direction_bit | perm_seg
   3          12          1             1             1              1             1      = 20

New / newly significant columns:

  • mrupdate_id -- domain separator for sibling-table entries
  • is_boundary -- marks first controller input / last controller output
  • direction_bit -- propagated Merkle routing bit on controller rows
  • perm_seg -- explicit controller vs permutation-region flag

Memory: 15 -> 17

Two new columns:

  • w0
  • w1

These decompose the word address into 16-bit limbs. The wiring bus carries the corresponding range-check lookups.

Constraints

Hasher constraints now total 100.

Constraint group breakdown
Group Count Purpose
Selector booleanity 3 s0,s1,s2 binary on controller rows
Perm segment 7 perm_seg confinement, booleanity, monotonicity, cycle alignment
Structural 7 Confine is_boundary / direction_bit to valid row types
Lifecycle 2 Operation lifecycle invariants
Controller adjacency 2 Input row must be followed by output row
Controller pairing 4 First-row constraint, output non-adjacency, padding stability
Perm witness-shape 3 Zero witness slots when unused
Perm init+ext 12 Row 0 packed transition
Perm external 12 External-round transitions
Perm packed internal 15 3 witness checks + 12 next-state constraints
Perm int+ext 13 1 witness check + 12 next-state constraints
MRUPDATE ID 2 Increment / zero-on-perm rules
Sponge capacity 4 Preserve capacity across continuations
Output index 1 Output-row node_index rule
Merkle index 4 Index decomposition / continuity / direction bit
Merkle input state 4 Zero capacity on Merkle input rows
Merkle routing 5 Route digest into correct rate half
Total 100

Trace width impact

Chiplet Before After Delta
Hasher 16 20 +4
Memory 15 17 +2
Net main trace impact +1

The new main trace width is 72

No new auxiliary columns were added:

  • the permutation-link bus shares V_WIRING
  • memory address range checks also use the existing wiring-bus path

@Al-Kindi-0
Copy link
Copy Markdown
Contributor Author

To compare against the numbers in #2869 for the recursive verifier (verifying a program executing in 2^20 cycles)

  ┌────────────────────────────┬──────────────┬──────────────┬───────────┬──────────┐
  │         Component          │   Old        │   New.       │  Change   │ Savings  │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ Core trace (decoder+stack) │ 41,652       │ 41,516       │ -136      │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ Range checker              │ 5,129        │ 5,217        │ +88       │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ Chiplets total             │ 273,769      │ 118,657      │ -155,112  │ -57%     │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ - Hasher                   │ 250,816      │ 96,256       │ -154,560  │ -62%     │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ - Bitwise                  │ 3,104        │ 3,104        │ 0         │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ - Memory                   │ 13,758       │ 13,406       │ -352      │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ - ACE                      │ 6,090        │ 5,890        │ -200      │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ - Kernel ROM               │ 0            │ 0            │ 0         │          │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ Padded trace length        │524,288 (2^19)│131,072 (2^17)│           │ -4x      │
  ├────────────────────────────┼──────────────┼──────────────┼───────────┼──────────┤
  │ Padding                    │ 47%          │ 9%           │           │          │
  └────────────────────────────┴──────────────┴──────────────┴───────────┴──────────┘

This is a 4x improvement in the above case.
(Note that the changes in the decoder+stack number of rows is due to a change in the constraints which affects ACE circuit loading)

@Al-Kindi-0 Al-Kindi-0 force-pushed the al-hasher-chiplet-redesign branch from da140ac to 77773d9 Compare March 27, 2026 15:19
@Al-Kindi-0 Al-Kindi-0 changed the title Al hasher chiplet redesign Hasher chiplet redesign Mar 27, 2026
@adr1anh adr1anh self-requested a review March 28, 2026 09:28
let w1: AB::Expr = local.chiplets[MEMORY_WORD_ADDR_HI_COL_IDX - CHIPLETS_OFFSET].clone().into();
let w1_mul4: AB::Expr = w1.clone() * AB::Expr::from_u16(4);

let den0: AB::ExprEF = alpha.clone() + Into::<AB::ExprEF>::into(w0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this add protocol-level domain separation before v_wiring can safely carry ACE wires, raw memory range-check values, and the new hasher perm-link messages together? Right now the memory side uses plain alpha + w0/w1/4*w1, ACE uses encode([clk, ctx, id, ...]), and the perm-link uses encode([0|1, h0..h11]) on the same LogUp column.

If any of those encodings was to alias, could one subsystem cancel another on the shared sum? #1614 explicitly called out adding an op-label when reusing the wiring bus, and I don't see that namespace implemented here yet.

Copy link
Copy Markdown
Contributor

@Nashtare Nashtare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to do another pass because this is pretty dense, but left a couple commetns while familiarizing myself with it

Comment on lines +45 to +46
let hs0 = main_trace.chiplet_selector_1(row);
let hs1 = main_trace.chiplet_selector_2(row);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the indexing shift is a bit confusing when looking below, maybe renamed to

Suggested change
let hs0 = main_trace.chiplet_selector_1(row);
let hs1 = main_trace.chiplet_selector_2(row);
let hs1 = main_trace.chiplet_selector_1(row);
let hs2 = main_trace.chiplet_selector_2(row);

with according updates later in the code would be clearer?

Comment on lines +20 to +23
/// TODO: These naive labels (0 and 1) risk collisions with other messages on the shared
/// v_wiring column (ACE wiring and memory range checks). Revisit when refactoring the buses.
const LABEL_IN: Felt = Felt::ZERO;
const LABEL_OUT: Felt = Felt::ONE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below in this file, but I was just wondering if we should not treat this now rather than deferring it?

}
} else {
// Permutation segment.
// This works because the hasher is always the first chiplet (rows start at 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: do we have an invariant check that this is actually the case? Just out of precaution

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m pretty sure we always have 1 hash, and the chiplet selectors make sure that it comes before any other chiplet

Comment on lines +70 to +72
/// Maps input state -> multiplicity for permutation deduplication.
/// During finalize_trace(), one 16-row perm cycle is emitted per entry.
perm_request_map: BTreeMap<StateKey, u64>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does ordering matter? I think perf wise a HashMap may be more efficient here

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is mainly because of no-std, but I don't think the ordering matters.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashMap can be pulled from alloc / hashbrown though (if perf matters here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants