README > CUTLASS 3: Building with SYCL support
SYCL[1] is a royalty-free, cross-platform abstraction layer that enables code for heterogeneous and offload processors to be written with modern ISO C++, and provides API and abstractions to find devices and manage resources for GPUs.
The CUTLASS-SYCL supports running on Intel GPUs. Currently, Intel Data Center Max 1550 and 1100 (a.k.a Ponte Vecchio - PVC) along with Intel Arc B580 (a.k.a BattleMage - BMG) are supported.
The examples directory shows a number of GEMM algorithms and examples of
CUTLASS-SYCL running on PVC and BMG, including flash attention V2.
Only Linux platforms are supported.
To build CUTLASS SYCL support for Intel GPUs, you need the DPC++ compiler; you can use the latest open source [nightly build] or a oneAPI toolkit from 2025.1 onwards. Intel Compute Runtime 25.13 (with Intel Graphics Compiler 2.10.10) is required. At the time of the release it can be installed from intel-graphics-staging. Installation from intel-graphics is recommended when it is available there.
Building the tests and the examples requires oneMKL for random number generation.
The following instructions show how to use the nightly build to build the cutlass examples
Download oneAPI basekit
$ . /opt/intel/oneapi/setvars.sh
$ CC=icx CXX=icpx cmake .. -G Ninja \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=intel_gpu_pvc \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_CXX_FLAGS="-ftemplate-backtrace-limit=0 -fdiagnostics-color=always"
# Download the nightly of DPCPP compiler
$ wget https://github.com/intel/llvm/releases/tag/nightly-2025-01-31
# Setup the environment variables
$ export PATH_TO_DPCPP=/path/to/your/dpcpp/installation
$ export PATH=${PATH_TO_DPCPP}/bin/:$PATH
$ export LD_LIBRARY_PATH=${PATH_TO_DPCPP}/lib/:$LD_LIBRARY_PATH
$ export RPATH=${PATH_TO_DPCPP}/lib/:$RPATH
# Create the build directory and configure CMake
# mkdir build_intel; cd build_intel
$ CC=clang CXX=clang++ cmake .. -G Ninja \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=intel_gpu_pvc \
-DCMAKE_EXPORT_COMPILE_COMMANDS=ON \
-DCMAKE_CXX_FLAGS="-ftemplate-backtrace-limit=0 -fdiagnostics-color=always"
CMake will check that DPC++ compiler is available in the system, and it will download the MKL library if it cannot find it. To get better performance result we require the following combinations of the environment variables flags to provide better performance hints for generating optimised code. For ahead of time (AOT) compilation, the following options have to be set during compilation:
export SYCL_PROGRAM_COMPILE_OPTIONS="-ze-opt-large-register-file"
export IGC_VISAOptions="-perfmodel"
export IGC_VectorAliasBBThreshold=10000
export IGC_ExtraOCLOptions="-cl-intel-256-GRF-per-thread"
To build and run a simple PVC/BMG gemm example run the commands below.
$ ninja 00_bmg_gemm
$ ./examples/00_bmg_gemm/00_bmg_gemm
Disposition: Passed
Problem Size: 5120x4096x4096x1
Cutlass GEMM Performance: [247.159]TFlop/s (0.6951)ms
The SYCL backend supports compilation for NVIDIA GPUs using the oneAPI NVIDIA plugin. This support is only for testing and validation purposes and not intended for production.
To build CUTLASS SYCL support you need the latest version of DPC++ compiler. You can either use a recent nightly build or build the compiler from source as described in oneAPI DPC++ guideline.
Once you have your compiler installed, you need to point the
CMAKE_CUDA_HOST_COMPILER flag to the clang++ provided by it.
This enables the compilation of SYCL sources without altering the current NVCC path. For example, to build SYCL support for SM 80
GPUs, you can use the following command:
cmake -G Ninja \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda \
-DDPCPP_SYCL_ARCH=sm_80
Currently, you can build the CuTe Tutorial using the following command:
ninja [EXAMPLE_NAME]_sycl
You can run it like this from your build directory
LD_LIBRARY_PATH=/path/to/sycl/install/lib ./examples/cute/tutorial/[EXAMPLE_NAME]_sycl
Currently, the example 14_amper_tf32_tensorop_gemm has been implemented for SYCL on Nvidia Ampere architecture. You can build this from your build directory by running :
ninja 14_ampere_tf32_tensorop_gemm_cute
You can run it like this from your build directory
LD_LIBRARY_PATH=/path/to/sycl/install/lib ./examples/14_ampere_tf32_tensorop_gemm/14_ampere_tf32_tensorop_gemm_cute