Lowering from CudaTile IR into TT-MLIR Stack #6409

vproHacks · 2025-12-20T18:45:52Z

vproHacks
Dec 20, 2025

Hey all!

NVidia recently open sourced their CudaTile Dialect onto Github (https://github.com/NVIDIA/cuda-tile)! This is super exciting since we can now see the guts inside their cuTile DSL. Namely, I can foresee a path to lower from the cuTile DSL into the CudaTile IR, and then lower this "kernel" specific IR into the D2M dialect (specifically the d2m.generic for the kernels).

Considering the massive industry steam behind nvidia and the plethora of example kernels in https://github.com/NVIDIA/TileGym, I think this would be an exciting path to increasing the amount of support and entrypoints into the TT Stack.

I'm curious to hear if this is on the roadmap / planning for TT-Forge, what the approach will potentially look like (what would the entry point into the TT-MLIR Stack look like), and the thoughts of the team on the situation. It seems like there is a new "tile-based parallel DSL" being released every month, hopefully this project has the chance to create unity between the largest "moat" in the industry and TT.

Below I've just added some interesting code snippets from the various nvidia projects:

    # Write MLIR module to file
    if CUDA_TILE_DUMP_TILEIR is not None:
        try:
            from cuda.tile_internal._internal_cext import bytecode_to_mlir_text
            mlir_text = bytecode_to_mlir_text(bytecode_buf)
            if not os.path.exists(CUDA_TILE_DUMP_TILEIR):
                os.makedirs(CUDA_TILE_DUMP_TILEIR)
            base_filename = os.path.basename(func_ir.loc.filename.split(".")[0])
            path = os.path.join(
                CUDA_TILE_DUMP_TILEIR, f"{base_filename}.ln{func_ir.loc.line}.cuda_tile.mlir"
            )
            print(f"Dumping TILEIR MLIR module to file:{path}", file=sys.stderr)
            with open(path, "w") as f:
                print(mlir_text, file=f)
        except ImportError:
            print("Can't print MLIR because the internal extension is missing. "
                  "This is currently not a public feature.", file=sys.stderr)

https://github.com/NVIDIA/cutile-python/blob/3912ddac97ddee2e4c733fe6d4c972a46deef1ea/src/cuda/tile/_compile.py#L181

I find it interesting that it seems like the cuTile code is still using some internal version of what I can only assume is the CudaTile dialect in their compilation flow. This file was last updated 2 weeks ago, I wonder if the development between the 2 will unify or if there will be a "lag" between what users in cuTile are exposed to and what's open in the dialect.

%x_memref = make_tensor_view %X, shape = [%M, %N], strides = [%M, 1] : tile<i32> -> tensor_view<?x?xf32, strides=[?,1]>
%y_memref = make_tensor_view %Y, shape = [%M, %N], strides = [%M, 1] : tile<i32> -> tensor_view<?x?xf32, strides=[?,1]>
%x_view = make_partition_view %x_memref : partition_view<tile=(128x256), tensor_view<?x?xf32, strides=[?,1]>>
%y_view = make_partition_view %y_memref : partition_view<tile=(128x256), tensor_view<?x?xf32, strides=[?,1]>>

Their structure for data and tiles themselves is pretty cool, the abstraction makes sense considering the architecture differences, but I think it generalizes pretty well to the TT-MLIR Stack. It will be interesting to see how this would mesh together.

In general, this is pretty exciting! Thought I would probe around to ask questions in my fav open source AI compiler project 😆

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowering from CudaTile IR into TT-MLIR Stack #6409

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Lowering from CudaTile IR into TT-MLIR Stack #6409

Uh oh!

vproHacks Dec 20, 2025

Replies: 0 comments

vproHacks
Dec 20, 2025