Skip to content

Ensure copy_tile_init is called before each copy_tile from different CB sources #274

@brnorris03

Description

@brnorris03

Problem

When a compute kernel has multiple sequential compute operations that read from different CB sources, the order of inputs passed to init_sfpu affects correctness. This was discovered while implementing the three-compute pipeline test in test_two_computes.py.

Background

init_sfpu(icb, ocb) configures the unpacker based on the first input CB. The SFPU unpacker cannot be configured for a second operand - it assumes all input CBs have the same underlying data type. If subsequent copy_tile operations use different input CBs, copy_tile_init(cb_index) must be called to reconfigure the unpacker.

Current Behavior

In the generated C++ for a three-stage pipeline:

// Third compute - reads from CB3 (intermediate) and CB0 (persistent input)
cb_wait_front(get_compile_time_arg_val(3), v1);  // wait for CB3
init_sfpu(get_compile_time_arg_val(3), get_compile_time_arg_val(4));  // init with CB3
tile_regs_acquire();
copy_tile_init(get_compile_time_arg_val(3));
copy_tile(get_compile_time_arg_val(3), v3, v3);  // CB3 -> DST[0]
copy_tile_init(get_compile_time_arg_val(0));
copy_tile(get_compile_time_arg_val(0), v3, v2);  // CB0 -> DST[1] - may fail!

When init_sfpu is called with CB3 (just written by previous compute) and then we try to read from CB0 (persistent from beginning), the unpacker state may not be correct for CB0 even with copy_tile_init.

Current Workaround

The MLIR input order matters. Placing the persistent CB first in the ttl.compute inputs works:

// Works: persistent CB first
%result = ttl.compute ins(%a_ready, %r1_cb : ...)  // CB0 first, then CB3

// Fails: intermediate CB first  
%result = ttl.compute ins(%r1_cb, %a_ready : ...)  // CB3 first, then CB0

This causes init_sfpu(CB0, CB4) to be generated, which properly configures for the persistent input.

Proposed Fix

The compiler should be more robust about CB configuration:

  1. Option A: Always call init_sfpu with the first persistent/reader CB, not the first input CB
  2. Option B: Ensure the unpacker is properly reconfigured when switching between CB sources that may have different initialization states
  3. Option C: Document this ordering requirement if it's intentional hardware behavior

References

  • DeepWiki tt-metal: init_sfpu configures unpacker for first input only
  • Test: test/me2e/compute/test_two_computes.py::TestThreeComputePipeline
  • Related files:
    • lib/Dialect/TTL/Transforms/TTLInsertTileRegsSync.cpp
    • lib/Dialect/TTL/Transforms/ConvertTTLComputeToSCF.cpp

Impact

Without the workaround, multi-stage compute pipelines that reuse persistent input CBs later in the pipeline produce incorrect results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions