-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Summary
Use the FPU (Floating Point Unit / Matrix Engine) for binary operations instead of SFPU. FPU operations are faster because they work directly with circular buffers through dedicated SrcA/SrcB registers, avoiding explicit copy_tile operations.
Motivation
Current limitation: Binary operations (ttl.tile_add, ttl.tile_sub, ttl.tile_mul) currently lower to SFPU operations, which require explicit data movement to DST registers.
Target: Use FPU operations that work directly with circular buffers, reducing data movement and DST register pressure.
Current (SFPU - Suboptimal)
// ❌ SFPU: Requires explicit copy to DST registers
copy_tile(cb_in0, 0, 0); // CB → DST[0]
copy_tile(cb_in1, 0, 1); // CB → DST[1]
add_binary_tile(0, 1, 2); // DST[0] + DST[1] → DST[2]
// Uses 3 DST registersTarget (FPU - Optimal)
// ✅ FPU: Works directly with circular buffers
add_tiles(cb_in0, cb_in1, 0, 0, 0); // CB[in0][0] + CB[in1][0] → DST[0]
// Uses 1 DST registerPerformance Impact
Expected improvements:
- Reduced data movement: No
copy_tileoperations needed - Lower DST register pressure: Only 1 DST register instead of 3
- Faster execution: FPU operations ~2x faster than SFPU for binary ops
- Better unrolling: Lower register pressure enables higher unroll factors
Implementation
Update TTL → TTKernel Lowering
Location: lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp
Pattern: TTLTileBinaryOpLowering
Decision logic: Choose FPU vs SFPU based on operand provenance:
bool shouldUseFPU(Value lhs, Value rhs) {
// Use FPU if both operands come directly from circular buffers
bool lhsFromCB = isa<TensorExtractOp>(lhs.getDefiningOp());
bool rhsFromCB = isa<TensorExtractOp>(rhs.getDefiningOp());
return lhsFromCB && rhsFromCB && isFPUBinaryOp(op);
}When to use SFPU instead:
- Operands are intermediate DST values (not from CBs)
- Operation not supported by FPU (e.g., div, complex math)
- Operands require pre-processing (type conversion, layout transform)
FPU Binary Operations API
// CB[in0][idx0] + CB[in1][idx1] → DST[odst]
add_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
sub_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
mul_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);Update DST Allocation
Location: lib/Dialect/TTL/Transforms/TTLAssignDST.cpp
- Recognize FPU operations use fewer DST registers
- Adjust register pressure calculation for unroll factor computation
Testing
- Unit test: Binary ops with CB operands → FPU operations
- Unit test: Binary ops with DST operands → SFPU operations (fallback)
- Integration test: Compare generated C++ (FPU vs SFPU)
- Performance test: Measure speedup on real hardware
Dependencies
- TTKernel support for FPU binary operations: [d2m] enable FPU lowering for binary add, sub, mul tt-mlir#6540 ✅ Merged
References
- Design doc:
plans/DST_unrolling.md(Future Work: FPU Utilization section) - tt-metal eltwise binary example:
METALIUM_GUIDE.md - FPU compute APIs:
docs/source/tt-metalium/tt_metal/apis/kernel_apis/compute/compute.rst
Parent Issue
Part of #243 (DST/Loop optimizations)