|
| 1 | +--- |
| 2 | +title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time" |
| 3 | +date: 2024-07-31 |
| 4 | +layout: update |
| 5 | +tags: |
| 6 | + - cuda |
| 7 | + - sycl |
| 8 | + - oneapi |
| 9 | + - porting |
| 10 | +--- |
| 11 | + |
| 12 | +## Introduction |
| 13 | + |
| 14 | +The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those |
| 15 | +encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which |
| 16 | +matches human capabilities, places LLMs at the forefront of AI models. |
| 17 | + |
| 18 | +The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine |
| 19 | +learning domains such as image and video classification, semantic segmentation, object detection, and other natural |
| 20 | +language processing for general-purpose language generation through several neural networks (NN) architectures such as |
| 21 | +convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for |
| 22 | +generative AI. |
| 23 | + |
| 24 | +While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in |
| 25 | +some scenarios a particular type of inference only NN architecture is required for specific devices such as edge |
| 26 | +computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single |
| 27 | +GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may |
| 28 | +struggle to use such frameworks. |
| 29 | + |
| 30 | +With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models, |
| 31 | +llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among |
| 32 | +them, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp is a C++-based open source library</a> that can be used |
| 33 | +with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal |
| 34 | +dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and |
| 35 | +cloud based hardware. |
| 36 | + |
| 37 | +[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on |
| 38 | +devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp |
| 39 | +highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core |
| 40 | +of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in |
| 41 | +turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization |
| 42 | +scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning |
| 43 | +the models goes into finding the right quantization parameters, and the code performs several custom matrix |
| 44 | +multiplication operations to reduce precision loss on custom quantization schemes. |
| 45 | + |
| 46 | +## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic) |
| 47 | + |
| 48 | +This article will now describe how to migrate the existing llama.cpp CUDA backend to |
| 49 | +SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can |
| 50 | +then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable, |
| 51 | +single-source code. |
| 52 | + |
| 53 | +Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of |
| 54 | +Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that |
| 55 | +conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel |
| 56 | +on both NVIDIA and Intel GPUs. |
| 57 | + |
| 58 | +A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating |
| 59 | +individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting |
| 60 | +point for your SYCL multi-target application. |
| 61 | + |
| 62 | +## Preparation |
| 63 | + |
| 64 | +For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a |
| 65 | +remote system with an Intel Data Center GPU Max series 1110. |
| 66 | + |
| 67 | +I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2. |
| 68 | + |
| 69 | +Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first |
| 70 | +with the path to your Intel oneAPI Base Toolkit location): |
| 71 | + |
| 72 | +```shell |
| 73 | +$ cd /path/to/intel/oneAPI/Toolkit |
| 74 | +$ . setvars.sh ~/intel/oneapi |
| 75 | +$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0 |
| 76 | +``` |
| 77 | + |
| 78 | +Before we can run our model, we have to download it. There are many models supported |
| 79 | +by llama.cpp, and the list keeps growing! In this example we are going to download the llama 2 –7B model, already |
| 80 | +quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for |
| 81 | +creating a model's directory in my home folder. |
| 82 | + |
| 83 | +```shell |
| 84 | +$ mkdir $HOME/models/ ; cd $HOME/models/ |
| 85 | +$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf |
| 86 | +``` |
| 87 | + |
| 88 | +On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a |
| 89 | +download, so you must build it as follows: |
| 90 | + |
| 91 | +```shell |
| 92 | +$ git clone https://github.com/oneapi-src/oneMKL.git |
| 93 | +$ cd oneMKL/; mkdir build; cd build |
| 94 | +$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/ |
| 95 | +$ ninja install |
| 96 | +``` |
| 97 | + |
| 98 | +This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl |
| 99 | +directory within your home folder. |
| 100 | + |
| 101 | +## Steps for the conversion |
| 102 | + |
| 103 | +The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below. |
| 104 | + |
| 105 | +```shell |
| 106 | +$ git clone https://github.com/ggerganov/llama.cpp.git |
| 107 | +$ cd llama.cpp |
| 108 | +$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7 |
| 109 | +$ mkdir build && cd build |
| 110 | +$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON - |
| 111 | +$ DCMAKE_CUDA_ARCHITECTURES=80 |
| 112 | +``` |
| 113 | + |
| 114 | +In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial |
| 115 | +porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight |
| 116 | +out of the box with SYCLomatic. |
| 117 | + |
| 118 | +Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below: |
| 119 | + |
| 120 | +```shell |
| 121 | +$ intercept-build make |
| 122 | +``` |
| 123 | + |
| 124 | +intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued |
| 125 | +while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version |
| 126 | +of the application. |
| 127 | + |
| 128 | +Now we are going to use the information collected by intercept-build to generate a SYCL |
| 129 | +build directory by running the dpct command itself: |
| 130 | + |
| 131 | +```shell |
| 132 | +$ cd ../.. && mkdir dpct_out |
| 133 | +``` |
| 134 | + |
| 135 | +```shell |
| 136 | +$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all |
| 137 | +``` |
| 138 | + |
| 139 | +When using the `-p` option, it will find the compilation database and use that to convert all project files. In this |
| 140 | +case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in |
| 141 | +to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to |
| 142 | +process all files. |
| 143 | + |
| 144 | +## Next Part |
| 145 | + |
| 146 | +Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on |
| 147 | +NVIDIA and Intel GPUs. |
| 148 | + |
| 149 | +[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time) |
0 commit comments