Skip to content

FPU utilization for binary operations (add, sub, mul) #245

@brnorris03

Description

@brnorris03

Summary

Use the FPU (Floating Point Unit / Matrix Engine) for binary operations instead of SFPU. FPU operations are faster because they work directly with circular buffers through dedicated SrcA/SrcB registers, avoiding explicit copy_tile operations.

Motivation

Current limitation: Binary operations (ttl.tile_add, ttl.tile_sub, ttl.tile_mul) currently lower to SFPU operations, which require explicit data movement to DST registers.

Target: Use FPU operations that work directly with circular buffers, reducing data movement and DST register pressure.

Current (SFPU - Suboptimal)

// ❌ SFPU: Requires explicit copy to DST registers
copy_tile(cb_in0, 0, 0);      // CB → DST[0]
copy_tile(cb_in1, 0, 1);      // CB → DST[1]
add_binary_tile(0, 1, 2);     // DST[0] + DST[1] → DST[2]
// Uses 3 DST registers

Target (FPU - Optimal)

// ✅ FPU: Works directly with circular buffers
add_tiles(cb_in0, cb_in1, 0, 0, 0);  // CB[in0][0] + CB[in1][0] → DST[0]
// Uses 1 DST register

Performance Impact

Expected improvements:

  • Reduced data movement: No copy_tile operations needed
  • Lower DST register pressure: Only 1 DST register instead of 3
  • Faster execution: FPU operations ~2x faster than SFPU for binary ops
  • Better unrolling: Lower register pressure enables higher unroll factors

Implementation

Update TTL → TTKernel Lowering

Location: lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp

Pattern: TTLTileBinaryOpLowering

Decision logic: Choose FPU vs SFPU based on operand provenance:

bool shouldUseFPU(Value lhs, Value rhs) {
  // Use FPU if both operands come directly from circular buffers
  bool lhsFromCB = isa<TensorExtractOp>(lhs.getDefiningOp());
  bool rhsFromCB = isa<TensorExtractOp>(rhs.getDefiningOp());
  return lhsFromCB && rhsFromCB && isFPUBinaryOp(op);
}

When to use SFPU instead:

  • Operands are intermediate DST values (not from CBs)
  • Operation not supported by FPU (e.g., div, complex math)
  • Operands require pre-processing (type conversion, layout transform)

FPU Binary Operations API

// CB[in0][idx0] + CB[in1][idx1] → DST[odst]
add_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
sub_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
mul_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);

Update DST Allocation

Location: lib/Dialect/TTL/Transforms/TTLAssignDST.cpp

  • Recognize FPU operations use fewer DST registers
  • Adjust register pressure calculation for unroll factor computation

Testing

  • Unit test: Binary ops with CB operands → FPU operations
  • Unit test: Binary ops with DST operands → SFPU operations (fallback)
  • Integration test: Compare generated C++ (FPU vs SFPU)
  • Performance test: Measure speedup on real hardware

Dependencies

References

  • Design doc: plans/DST_unrolling.md (Future Work: FPU Utilization section)
  • tt-metal eltwise binary example: METALIUM_GUIDE.md
  • FPU compute APIs: docs/source/tt-metalium/tt_metal/apis/kernel_apis/compute/compute.rst

Parent Issue

Part of #243 (DST/Loop optimizations)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions