Merge pull request #30 from codeplaysoftware/add-llama-updates

codeplaymax · web-flow · commit 921fbfb06cb7 · 2024-08-16T14:22:28.000+01:00
Add llama updates
diff --git a/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md b/_collections/_updates/2024-07-31-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one.md
@@ -0,0 +1,149 @@
+---
+title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
+date: 2024-07-31
+layout: update
+tags:
+  - cuda
+  - sycl
+  - oneapi
+  - porting
+---
+
+## Introduction
+
+The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those
+encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which
+matches human capabilities, places LLMs at the forefront of AI models.
+
+The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine
+learning domains such as image and video classification, semantic segmentation, object detection, and other natural
+language processing for general-purpose language generation through several neural networks (NN) architectures such as
+convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for
+generative AI.
+
+While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in
+some scenarios a particular type of inference only NN architecture is required for specific devices such as edge
+computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single
+GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may
+struggle to use such frameworks.
+
+With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models,
+llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among
+them, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp is a C++-based open source library</a> that can be used
+with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal
+dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and
+cloud based hardware.
+
+[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on
+devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp
+highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core
+of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in
+turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization
+scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning
+the models goes into finding the right quantization parameters, and the code performs several custom matrix
+multiplication operations to reduce precision loss on custom quantization schemes.
+
+## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic)
+
+This article will now describe how to migrate the existing llama.cpp CUDA backend to
+SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can
+then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable,
+single-source code.
+
+Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of
+Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that
+conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel
+on both NVIDIA and Intel GPUs.
+
+A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating
+individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting
+point for your SYCL multi-target application.
+
+## Preparation
+
+For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a
+remote system with an Intel Data Center GPU Max series 1110.
+
+I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2.
+
+Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first
+with the path to your Intel oneAPI Base Toolkit location):
+
+```shell
+$ cd /path/to/intel/oneAPI/Toolkit
+$ . setvars.sh  ~/intel/oneapi
+$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0
+```
+
+Before we can run our model, we have to download it. There are many models supported
+by llama.cpp, and the list keeps growing!  In this example we are going to download the llama 2 –7B model, already
+quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for
+creating a model's directory in my home folder.
+
+```shell
+$ mkdir $HOME/models/ ; cd $HOME/models/
+$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
+```
+
+On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a
+download, so you must build it as follows:
+
+```shell
+$ git clone https://github.com/oneapi-src/oneMKL.git
+$ cd oneMKL/; mkdir build; cd build
+$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/
+$ ninja install
+```
+
+This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl
+directory within your home folder.
+
+## Steps for the conversion
+
+The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below.
+
+```shell
+$ git clone https://github.com/ggerganov/llama.cpp.git
+$ cd llama.cpp
+$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7
+$ mkdir build && cd build
+$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON -
+$ DCMAKE_CUDA_ARCHITECTURES=80
+```
+
+In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial
+porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight
+out of the box with SYCLomatic.
+
+Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below:
+
+```shell
+$ intercept-build make
+```
+
+intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued
+while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version
+of the application.
+
+Now we are going to use the information collected by intercept-build to generate a SYCL
+build directory by running the dpct command itself:
+
+```shell
+$ cd ../.. && mkdir dpct_out
+```
+
+```shell
+$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all
+```
+
+When using the `-p` option, it will find the compilation database and use that to convert all project files. In this
+case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in
+to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to
+process all files.
+
+## Next Part
+
+Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on
+NVIDIA and Intel GPUs.
+
+[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time)
diff --git a/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md b/_collections/_updates/2024-08-13-part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time.md
@@ -0,0 +1,96 @@
+---
+title: "Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
+date: 2024-08-13
+layout: update
+tags:
+- cuda
+- sycl
+- oneapi
+- porting
+---
+
+## Prelude
+
+[In our first part](/updates/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one)
+we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take
+this portable code, and run it across an NVIDIA and Intel GPU.
+
+## Building on the NVIDIA system
+
+Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then
+build the main binary for llama.cpp.
+
+```shell
+$ cd dpct_out && mkdir syclbuild && cd syclbuild
+$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib" 
+$ make main
+```
+
+Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX
+flags accordingly. We also pass manually the target triple (`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the
+SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following
+command:
+
+```shell
+$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899  -no-mmap
+```
+
+The environment variable `ONEAPI_DEVICE_SELECTOR` allows users to override the default selection mechanism of the SYCL
+queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t
+work because we explicitly build for NVIDIA GPUs.
+
+The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good
+starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an
+Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path.
+
+## Running on an Intel GPU system
+
+To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU.
+
+Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can
+run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel
+GPUs, using the original CMake flags we used to convert the project.
+
+```shell
+$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80
+```
+
+Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on
+the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies.
+Once it is configured, you can
+
+```shell
+$ make main
+```
+
+Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU,
+just use the level zero GPU backend, as shown below:
+
+```shell
+$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf  --no-mmap -ngl 128
+```
+
+Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is
+done by the tool, and you can focus on optimization and refactoring of the generated code.
+
+## Conclusions
+
+In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that!
+The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist
+you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can
+then replace later on for code that suits your application better.
+
+We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel
+with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already,
+having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application
+in a wider set of platforms with less code changes.
+
+The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we
+just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of
+platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the
+original code that SYCLomatic generated for us. That original conversion saved several engineering months to get
+something up and running, and allowed us to focus on the important parts of the project: performance and code quality.
+
+If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us
+at [dev-rel@codeplay.com](mailto:dev-rel@codeplay.com).
+
diff --git a/static/css/styled.scss b/static/css/styled.scss
@@ -811,6 +811,24 @@ body {
                 width: 100%;
                 height: auto;
             }
+
+            code {
+                padding: .1rem .2rem;
+                background-color: #d0d0d0;
+                display: inline-block;
+                border-radius: 6px;
+            }
+
+            pre code {
+                display: block;
+                max-width: 100%;
+                word-break: break-word;
+                white-space: break-spaces;
+                background-color: var(--hint-color);
+                color: white;
+                padding: 1rem;
+                border-radius: 12px;
+            }
         }
     }