This page covers compiler options, print debugging, performance tools, the simulator, and examples for tt-lang kernel development.
Kernels accept compiler options that control code generation (e.g., --no-ttl-maximize-dst, --no-ttl-fpu-binary-ops). These can be passed as command-line arguments, via the @ttl.kernel decorator's options= parameter, or the TTLANG_COMPILER_OPTIONS environment variable. Command-line arguments take highest priority.
# List available options
python examples/tutorial/multicore_grid_auto.py --ttl-help
# Run a kernel with options
python examples/tutorial/multicore_grid_auto.py --no-ttl-maximize-dstSee the full compiler options reference for all decorator parameters, CompilerOptions flags with their MLIR pass mappings, environment variables, and ttlang-opt pass options.
Use print() inside kernel code to emit device debug prints. Enable at runtime with TT_METAL_DPRINT_CORES:
export TT_METAL_DPRINT_CORES=0,0 # core to capture
python my_kernel.py 2>&1 > output.txt@ttl.compute()
def compute():
with inp_dfb.wait() as tile, out_dfb.reserve() as o:
print("hello") # auto: math thread
print(tile) # auto: pack thread
result = ttl.exp(tile)
print(_dump_dst_registers=True, label="after exp") # auto: math thread
o.store(result)
@ttl.datamovement()
def dm_write():
print(out_dfb) # CB metadata
with out_dfb.wait() as blk:
print(blk, num_pages=1) # raw tensor page
tx = ttl.copy(blk, out[0, 0])
tx.wait()- Prints can be extremely large and slow; redirect output to a file and use grep.
- In compute kernels, guard prints with
thread="math",thread="pack", orthread="unpack"to avoid overlapping output from the three TRISC threads. - When using multi-tile block sizes (CB shape > 1x1), prints inside the generated loop will dump all tiles in the block.
See the full print debugging reference for all supported modes (scalars, tiles, tensor pages, CB details, DST registers, thread conditioning).
TT-Lang includes built-in performance analysis tools for profiling kernels on hardware:
- Perf Summary (
TTLANG_PERF_DUMP=1) — NOC traffic and per-thread wall time breakdown - Auto-Profiling (
TTLANG_AUTO_PROFILE=1) — automatic per-line cycle count instrumentation - User-Defined Signposts (
TTLANG_SIGNPOST_PROFILE=1) — targeted cycle counts forttl.signpost()regions - Perfetto Trace Server (
TTLANG_PERF_SERV=1) — visualize profiler data in the Perfetto UI
Performance tracing (Tracy) is enabled by default at build time. To disable it, configure with -DTTLANG_ENABLE_PERF_TRACE=OFF.
See the full performance tools reference for environment variable details, valid combinations, and sample output.
See the Functional Simulator page for running kernels without hardware, debugging setup, and test commands.
See the examples/ and test/ directories for complete working examples, including:
test/python/simple_add.pytest/python/simple_fused.py
The tutorial provides step-by-step examples from single-tile to multinode kernels.