FPU utilization for binary operations (add, sub, mul)

## Summary

Use the FPU (Floating Point Unit / Matrix Engine) for binary operations instead of SFPU. FPU operations are faster because they work directly with circular buffers through dedicated SrcA/SrcB registers, avoiding explicit `copy_tile` operations.

## Motivation

**Current limitation**: Binary operations (`ttl.tile_add`, `ttl.tile_sub`, `ttl.tile_mul`) currently lower to **SFPU** operations, which require explicit data movement to DST registers.

**Target**: Use **FPU** operations that work directly with circular buffers, reducing data movement and DST register pressure.

### Current (SFPU - Suboptimal)

```cpp
// ❌ SFPU: Requires explicit copy to DST registers
copy_tile(cb_in0, 0, 0);      // CB → DST[0]
copy_tile(cb_in1, 0, 1);      // CB → DST[1]
add_binary_tile(0, 1, 2);     // DST[0] + DST[1] → DST[2]
// Uses 3 DST registers
```

### Target (FPU - Optimal)

```cpp
// ✅ FPU: Works directly with circular buffers
add_tiles(cb_in0, cb_in1, 0, 0, 0);  // CB[in0][0] + CB[in1][0] → DST[0]
// Uses 1 DST register
```

## Performance Impact

**Expected improvements**:
- Reduced data movement: No `copy_tile` operations needed
- Lower DST register pressure: Only 1 DST register instead of 3
- Faster execution: FPU operations ~2x faster than SFPU for binary ops
- Better unrolling: Lower register pressure enables higher unroll factors

## Implementation

### Update TTL → TTKernel Lowering

**Location**: `lib/Dialect/TTL/Transforms/ConvertTTLToTTKernel.cpp`

**Pattern**: `TTLTileBinaryOpLowering`

**Decision logic**: Choose FPU vs SFPU based on operand provenance:

```cpp
bool shouldUseFPU(Value lhs, Value rhs) {
  // Use FPU if both operands come directly from circular buffers
  bool lhsFromCB = isa<TensorExtractOp>(lhs.getDefiningOp());
  bool rhsFromCB = isa<TensorExtractOp>(rhs.getDefiningOp());
  return lhsFromCB && rhsFromCB && isFPUBinaryOp(op);
}
```

**When to use SFPU instead**:
- Operands are intermediate DST values (not from CBs)
- Operation not supported by FPU (e.g., div, complex math)
- Operands require pre-processing (type conversion, layout transform)

### FPU Binary Operations API

```cpp
// CB[in0][idx0] + CB[in1][idx1] → DST[odst]
add_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
sub_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
mul_tiles(cb_in0, cb_in1, uint32_t idst0, uint32_t idst1, uint32_t odst);
```

### Update DST Allocation

**Location**: `lib/Dialect/TTL/Transforms/TTLAssignDST.cpp`

- Recognize FPU operations use fewer DST registers
- Adjust register pressure calculation for unroll factor computation

## Testing

- Unit test: Binary ops with CB operands → FPU operations
- Unit test: Binary ops with DST operands → SFPU operations (fallback)
- Integration test: Compare generated C++ (FPU vs SFPU)
- Performance test: Measure speedup on real hardware

## Dependencies

- TTKernel support for FPU binary operations: https://github.com/tenstorrent/tt-mlir/pull/6540 ✅ Merged

## References

- Design doc: `plans/DST_unrolling.md` (Future Work: FPU Utilization section)
- tt-metal eltwise binary example: `METALIUM_GUIDE.md`
- FPU compute APIs: `docs/source/tt-metalium/tt_metal/apis/kernel_apis/compute/compute.rst`

## Parent Issue

Part of #243 (DST/Loop optimizations)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FPU utilization for binary operations (add, sub, mul) #245

Summary

Motivation

Current (SFPU - Suboptimal)

Target (FPU - Optimal)

Performance Impact

Implementation

Update TTL → TTKernel Lowering

FPU Binary Operations API

Update DST Allocation

Testing

Dependencies

References

Parent Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

FPU utilization for binary operations (add, sub, mul) #245

Description

Summary

Motivation

Current (SFPU - Suboptimal)

Target (FPU - Optimal)

Performance Impact

Implementation

Update TTL → TTKernel Lowering

FPU Binary Operations API

Update DST Allocation

Testing

Dependencies

References

Parent Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions