-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Problem
When a compute kernel has multiple sequential compute operations that read from different CB sources, the order of inputs passed to init_sfpu affects correctness. This was discovered while implementing the three-compute pipeline test in test_two_computes.py.
Background
init_sfpu(icb, ocb) configures the unpacker based on the first input CB. The SFPU unpacker cannot be configured for a second operand - it assumes all input CBs have the same underlying data type. If subsequent copy_tile operations use different input CBs, copy_tile_init(cb_index) must be called to reconfigure the unpacker.
Current Behavior
In the generated C++ for a three-stage pipeline:
// Third compute - reads from CB3 (intermediate) and CB0 (persistent input)
cb_wait_front(get_compile_time_arg_val(3), v1); // wait for CB3
init_sfpu(get_compile_time_arg_val(3), get_compile_time_arg_val(4)); // init with CB3
tile_regs_acquire();
copy_tile_init(get_compile_time_arg_val(3));
copy_tile(get_compile_time_arg_val(3), v3, v3); // CB3 -> DST[0]
copy_tile_init(get_compile_time_arg_val(0));
copy_tile(get_compile_time_arg_val(0), v3, v2); // CB0 -> DST[1] - may fail!When init_sfpu is called with CB3 (just written by previous compute) and then we try to read from CB0 (persistent from beginning), the unpacker state may not be correct for CB0 even with copy_tile_init.
Current Workaround
The MLIR input order matters. Placing the persistent CB first in the ttl.compute inputs works:
// Works: persistent CB first
%result = ttl.compute ins(%a_ready, %r1_cb : ...) // CB0 first, then CB3
// Fails: intermediate CB first
%result = ttl.compute ins(%r1_cb, %a_ready : ...) // CB3 first, then CB0This causes init_sfpu(CB0, CB4) to be generated, which properly configures for the persistent input.
Proposed Fix
The compiler should be more robust about CB configuration:
- Option A: Always call
init_sfpuwith the first persistent/reader CB, not the first input CB - Option B: Ensure the unpacker is properly reconfigured when switching between CB sources that may have different initialization states
- Option C: Document this ordering requirement if it's intentional hardware behavior
References
- DeepWiki tt-metal:
init_sfpuconfigures unpacker for first input only - Test:
test/me2e/compute/test_two_computes.py::TestThreeComputePipeline - Related files:
lib/Dialect/TTL/Transforms/TTLInsertTileRegsSync.cpplib/Dialect/TTL/Transforms/ConvertTTLComputeToSCF.cpp
Impact
Without the workaround, multi-stage compute pipelines that reuse persistent input CBs later in the pipeline produce incorrect results.