Rebase/cache refactoring onto main by Aquaticfuller · Pull Request #20 · pulp-platform/ManyRVData

Aquaticfuller · 2026-05-18T20:38:39Z

No description provided.

- wire DataPartSplit/folded params through cluster/group/tile - implement skewed folded data SRAM mapping in cachepool_tile - adjust cluster wrapper tb for the new configuration

- add scalar cache tests that run basic and stress patterns without crossing 128-bit parts - add vector cache tests that use RVV loads/stores on 128-bit chunks and verify data integrity - integrate both tests into the test CMake and keep patterns aligned to folded-cache part size

Select a single part per column/bank per cycle and prioritize write parts over read parts to avoid clobbering bank signals.

- cachepool_tile: use EffectiveCoalFactor=1 in folded mode; pass to cache ctrl. - cachepool_cc: size Spatz response FIFO with NumSpatzOutstandingLoads; add overflow assert. - tcdm_cache_interco: add non-synthesis outstanding scoreboard/asserts for req/rsp matching.

…he access interleaving.

Bump insitu-cache to the folded/hash-way revision, thread UseHashWaySelect through cluster/tile, and queue Spatz memory responses through the local response FIFO instead of bypassing write acks.

cachepool_cc: per-port sb_q[user.req_id] slot table for out-of-order rsp matching; watchdog dumps stuck ids. Gated by parameter (default off, +define+ENABLE_SPATZ_REQ_SCOREBOARD to enable).

The skew-bank arbiter at (col, bank_sel) picks writes over reads without exposing the loser; a hardwired l1_data_bank_gnt=1 caused the upstream to consume stale rdata when another way wrote the same column. Compute any_other_write_in_col (loop-free, depends only on part_we) and gate gnt by it: writes always granted, reads granted iff no OTHER way writes the same (col, bank_sel). Excludes own way's writes so own idle words aren't spuriously stalled. Fixes multi-core coherence in rlc-mimic and unlocks AllowReadDuringWrite=1 on data banks.

- l1cache: flush+wait before xbar commit so the reconfig doesn't leave dirty lines bound to the old hash layout. - mcs-lock: move cluster barrier before the non-zero-core spin loop (otherwise cores 1+ never barrier and core 0 deadlocks). - load-store: print the correct buffer name (B/C, not A) in the B/C error messages; add c_ptr to the pointer dump. - idotp-32b: include got/expected in Check Failed! print.

…rw} tests Register five new cache-focused tests in CMakeLists.txt: - cache-line-rw-smoke single-core line-granular RW smoke - cache-rlc-mimic RLC traffic mimic (vector load/store) - cache-vector-rw multi-iteration vector load-store kernel - cache-coverage 12-phase multi-core cache stress / coverage - cache-coverage-min minimal phase-06 writeback-loss repro

- Bender.lock: bump insitu-cache to the rev with the wrapper/coalescer SBs and the SYNC_CTRL_CHECK_PEND fix. - Makefile: define ENABLE_SPATZ_REQ_SCOREBOARD so the in-RTL Spatz req/rsp watchdog is on by default. - cachepool_tile.sv: per-port pre-strip TCDM req tracer (+sb_pretrace_addr_lo/hi) and byte-granular shadow-memory model (+mm_enable) that $errors on DATA / TYPE / ORPHAN_RSP mismatches. Both passive, off by default, sim-only.

- config.mk: derive axi_user_width as base + 2*(idx_width(num_tiles)-1). Previous widths truncated bank_id MSB on the AXI loopback, routing cache_ctrl refill responses to the icache bypass slot. - cachepool_group.sv: use the source tile id `t` (not target_tile) for the request destination slot, so the response (routed by user.tile_id mod NumRemotePortCore) lands on the same xbar mst port as the request.

The `win` offset combined `it * 64u + cid * 7u`. The `cid * 7u` term is odd for cid > 0, so `wp = (base + win + j * 4U)` ended up unaligned for any non-zero core. Snitch raises a misaligned load/store exception for unaligned uint32_t accesses, and this runtime has no exception handler installed, so cores 1+ entered a trap loop at PC 0x800005fc while cores 0/2/3 stalled at the next sync_all. Result: the test always timed out without printing UART. Change the per-core stride to `cid * 28u` (= 7 * 4) so the offset stays varied per core but is always 4-aligned, restoring the original "varied window" intent. Test now passes with retval=0.

After the per-core stats printf, non-zero cores entered `while(1){}` and were never able to reach the second `snrt_cluster_hw_barrier()` below. Core 0 then waited forever at that barrier for cores 1+. Result: the kernel never reached `return 0`, _snrt_exit was never called, and EOC was never asserted -- the sim always timed out. Removing the if/while-loop (and the now-pointless second barrier) lets every core return cleanly; _snrt_exit only fires set_eoc on core 0 anyway, and the other cores halt naturally. mcs-lock now reaches EOC retval=0 cleanly.

The existing fft-32b_M1024_N16 test is parameterized for 16 cores -- data_1024_16.h has active_cores=16 baked in and the kernel slices the work by active_cores. On a 4-core config only 4/16 of the FFT actually executes, so the output is uniformly wrong (r:1024,i:1024) and the test self-fails with retval=1. Add a 1024_4 variant alongside, generated via gen_data.py from a new fft_1024_4.json config. Both variants now coexist; the N16 variant is appropriate for 4t/16c and the N4 variant for 1t/4c. The new 1024_4 test passes cleanly (r:0, i:0, retval=0).

Midpoint between cachepool_fpu_512 (1t/4c — passes) and cachepool_4t_fpu_512 (4t/16c — broken). Used to isolate whether the multi-tile cache failures are specific to 4 tiles or to any configuration with NumTiles > 1. cache-line-rw-smoke fails at 2t/8c with the same DATA-MISMATCH signature seen at 4t/16c, confirming the bug is in the inter-tile / group-xbar path itself, not a 4-tile-only artefact.

Reduces cache-line-rw-smoke to the smallest pattern that still triggers the multi-tile cache bug: * only core 0 does work (1 store + 1 load to one cache line) * all other cores immediately return 0 * no printf, no library calls * 16 words written + read On cachepool_fpu_512 (1 tile) this passes cleanly. On cachepool_2t_fpu_512 and cachepool_4t_fpu_512 the SB still flags RESP DATA MISMATCH on cache lines touched by the startup/exit runtime path (not by the test data). The test's own data check PASSES because the corrupted line is not the test's buf line, but the underlying cache-state bug is reproduced. Conclusion from this repro: the bug fires the moment NumTiles > 1 even on purely single-core local activity -- it is NOT a coherence problem (no cross-tile sharing happens here) and NOT a remote-port routing problem (no real remote traffic from cores 1+). The suspect surface narrows to the multi-tile-conditional rotation math in tcdm_cache_interco/cachepool_tile (bits_to_rotate widens from CacheBankBits to CacheBankBits+TileBits at NumTiles>1) or the remote-port muxing inside the local cache_ctrl when those ports are wired in even though they carry no traffic. Use for future waveform-level debug: make vsim config=cachepool_2t_fpu_512 -B ./sim/bin/cachepool_cluster.vsim \ software/build/CachePoolTests/test-cachepool-minimal-tile0-repro

The multi-tile (NumTiles>1) cluster instantiates cachepool_group, which then instantiates cachepool_tile -- but cachepool_group did not forward UseHashWaySelect, so the tile fell back to its own 1'b0 default. This silently disabled hash-way select on every multi-tile build and triggered the forwarding-buffer / skewed-fold data-corruption path. Add the missing parameter wiring; default to 1'b1 to match cachepool_cluster.sv.

…probes

…tives - cachepool_cluster.sv: zero-extend refill_user_t to the AxiUserWidth user_i port of reqrsp_to_axi + ASSERT_INIT(AxiUserWidth >= $bits(refill_user_t)). Behavior-preserving (was an implicit zero-extend); closes the refill-misroute hazard at elaboration. - config.mk: correct stale axi_user_width comment (cache_info_t has no tile_id; the tile term is over-provisioned headroom, not an exact fit). - lint.tcl: DU-scoped W123 waivers for cachepool_cache_ctrl (coalescer_resp/ bypass_resp driven via i_bypass_xbar slv_rsp_o aggregate) + spatz_decoder.

Mark vtNext/pduWithoutPoll/byteWithoutPoll _Atomic and use atomic_fetch_add (unique per-PDU SN + shared stats) instead of plain +=; add a release fence before each lock unlock. Fixes the multi-consumer data race; K100 passes eoc_clean with 0 scoreboard mismatches on 2t/4t x rp1/rp2.

…ys_ff/rst_ni - Move per-tile TCDM tracer + memory-model VIP out of cachepool_tile.sv into hardware/src/verif/cachepool_tile_tcdm_checker.sv (bind-attached); keeps the RTL body synthesis-clean. - tcdm_cache_interco: put rst_ni in the Probe-D always_ff sensitivity list. - Drop the cachepool_cache_ctrl W123 waiver: root-caused to '{}-as-lvalue on the bypass_xbar output ports (fixed in insitu-cache to concatenation {}); lint confirms the cache_ctrl W123 is gone. - Bump insitu-cache lock to 65940a3 (the concatenation fix).

- Add l1d_use_folded / l1d_fold_way_group / l1d_use_hash_way / l1d_use_fwd_buf knobs to cachepool_512.mk + cachepool_fpu_512.mk (default = production: folded+hash+fwd); emit them as VLOG_DEFS macros. - Thread UseForwardingBuffer through cluster->group->tile->cache_ctrl, and macro-default UseFoldedDataBanks/FoldWayGroup/UseHashWaySelect/UseForwardingBuffer at the wrapper from those macros. - Fix dropped param paths: forward UseFoldedDataBanks/FoldWayGroup through the cluster->group instantiation, and UseHashWaySelect/UseForwardingBuffer through wrapper->cluster (previously stuck at module defaults). - Bump insitu-cache lock to fbabd6a (fwd-buffer param + PartSplit=1 fixes). Verified: folded default bit-identical pass; unfolded conventional passes.

DiyouS

LGTM

DiyouS requested changes May 22, 2026

View reviewed changes

Comment thread hardware/src/tcdm_cache_interco.sv Outdated

Comment thread hardware/src/tcdm_cache_interco.sv Outdated

Comment thread hardware/src/cachepool_cc.sv

Comment thread hardware/tb/cachepool_cluster_wrapper.sv Outdated

Aquaticfuller added 21 commits May 27, 2026 23:30

[RTL] Add folded data bank plumbing

3446bba

- wire DataPartSplit/folded params through cluster/group/tile - implement skewed folded data SRAM mapping in cachepool_tile - adjust cluster wrapper tb for the new configuration

[RTL] prioritize writes in skewed folded bank selection

9c4d0f2

Select a single part per column/bank per cycle and prioritize write parts over read parts to avoid clobbering bank signals.

[SW] Add cache mix smoke and pressure tests for scalar and vector cac…

d420210

…he access interleaving.

[RTL] plumb hash-way folded cache integration

6e65cc1

Bump insitu-cache to the folded/hash-way revision, thread UseHashWaySelect through cluster/tile, and queue Spatz memory responses through the local response FIFO instead of bypassing write acks.

[RTL] Bump insitu cache dep.

df0b966

[RTL] Add Spatz<->TCDM id-indexed req/rsp scoreboard for debug

ca7a9b0

cachepool_cc: per-port sb_q[user.req_id] slot table for out-of-order rsp matching; watchdog dumps stuck ids. Gated by parameter (default off, +define+ENABLE_SPATZ_REQ_SCOREBOARD to enable).

[Bender] bump insitu-cache to zexin/sync-flush-fixes

33f3914

[VERIF] cc/interco: add plusarg-gated write-ack + addr-watch probes

8dc21a5

Aquaticfuller force-pushed the rebase/cache-refactoring-onto-main branch from 37bee71 to 8dc21a5 Compare May 27, 2026 21:42

Aquaticfuller added 6 commits May 28, 2026 12:08

[VERIF] cc: demote benign EOC write-ack FIFO tail to info

ce53511

[VERIF] cc/tile: guard debug probes with ifndef TARGET_SYNTHESIS

b43f2b4

[Bender] bump insitu-cache lock to tcdm_wrapper comb-loop fix (2710920)

824a6ea

[VERIF] cc/tile/interco: wrap long lines + verible waivers for debug …

3f7f8d8

…probes

DiyouS requested changes Jun 1, 2026

View reviewed changes

Aquaticfuller added 2 commits June 1, 2026 17:16

DiyouS approved these changes Jun 2, 2026

View reviewed changes

Aquaticfuller merged commit f5c3ef4 into main Jun 2, 2026
9 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebase/cache refactoring onto main#20

Rebase/cache refactoring onto main#20
Aquaticfuller merged 29 commits into
mainfrom
rebase/cache-refactoring-onto-main

Aquaticfuller commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DiyouS left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Aquaticfuller commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DiyouS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants