Skip to content

Commit 921fbfb

Browse files
authored
Merge pull request #30 from codeplaysoftware/add-llama-updates
Add llama updates
2 parents 2b97ae7 + 855a409 commit 921fbfb

File tree

3 files changed

+263
-0
lines changed

3 files changed

+263
-0
lines changed
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
---
2+
title: "Part One - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
3+
date: 2024-07-31
4+
layout: update
5+
tags:
6+
- cuda
7+
- sycl
8+
- oneapi
9+
- porting
10+
---
11+
12+
## Introduction
13+
14+
The rapid advancement of LLMs can be attributed to their ability to effectively tackle complex problems, such as those
15+
encountered in chatbots, virtual assistants, content generation, and language translation. Their performance, which
16+
matches human capabilities, places LLMs at the forefront of AI models.
17+
18+
The classical general purpose graph frameworks like PyTorch, TensorFlow, etc. can cover very wide ranges of machine
19+
learning domains such as image and video classification, semantic segmentation, object detection, and other natural
20+
language processing for general-purpose language generation through several neural networks (NN) architectures such as
21+
convolutional Neural networks, Recurrent neural networks, and various types of Transformer-based architectures for
22+
generative AI.
23+
24+
While such omnipotent frameworks can cover almost all training and inference aspects of AI models that are now used, in
25+
some scenarios a particular type of inference only NN architecture is required for specific devices such as edge
26+
computing or systems without a network connection. Such architectures may have some hardware limitations, e.g. single
27+
GPU or Single CPU only with limited memory and cache sizes and restricted operating system support. Hence developers may
28+
struggle to use such frameworks.
29+
30+
With the popularity of large language models, there are several lightweight frameworks, such as Meta’s llama models,
31+
llama.cpp, and vllm are provided to target only transformer-based architectures for inference models. Among
32+
them, <a href="https://github.com/ggerganov/llama.cpp">llama.cpp is a C++-based open source library</a> that can be used
33+
with the llama model amongst others. This is written using pure C/C++ and that enables LLM inference with minimal
34+
dependency to any third party libraries, while providing a state-of-the-art performance on a wide variety of local and
35+
cloud based hardware.
36+
37+
[llama.cpp](https://github.com/ggerganov/llama.cpp) is designed to run large language models efficiently on
38+
devices with limited resources, such as laptops or desktop pcs with GPUs. The C++ based implementation makes llama.cpp
39+
highly performant and portable, ideal for scenarios where computational power and memory are at a premium. At the core
40+
of llama.cpp is the quantization. Llama.cpp uses custom quantization types that drastically reduce model sizes, which in
41+
turn enables them to run on devices with limited memory. The challenging part here is to find the right quantization
42+
scheme that would prevent precision loss without causing hallucinations in the output; hence, a lot of effort of tuning
43+
the models goes into finding the right quantization parameters, and the code performs several custom matrix
44+
multiplication operations to reduce precision loss on custom quantization schemes.
45+
46+
## [SYCLomatic](https://github.com/oneapi-src/SYCLomatic)
47+
48+
This article will now describe how to migrate the existing llama.cpp CUDA backend to
49+
SYCL [using the SYCLomatic open source tool](https://github.com/oneapi-src/SYCLomatic). The migrated code can
50+
then be run across an NVIDIA system, and another system with Intel Data Center Max GPUs - demonstrating truly portable,
51+
single-source code.
52+
53+
Spoiler alert: We don’t really need to do this migration, Llama.cpp already has SYCL in upstream, thanks to the work of
54+
Intel and Codeplay teams. The work started with a SYCLomatic conversion back in December 2023. The feedback from that
55+
conversion led to a lot of improvements in SYCLomatic. The SYCL upstream support is now maintained by Codeplay and Intel
56+
on both NVIDIA and Intel GPUs.
57+
58+
A key benefit of SYCLomatic is that it is a whole project migration tool. This means it does not focus on migrating
59+
individual kernels or files, but instead provides a migration of the entire project that you can then use as a starting
60+
point for your SYCL multi-target application.
61+
62+
## Preparation
63+
64+
For this exercise, I am going to use two distinct machines: my local desktop pc with an integrated NVIDIA GPU, and a
65+
remote system with an Intel Data Center GPU Max series 1110.
66+
67+
I have installed the latest CUDA toolkit on both systems, as well as the Intel oneAPI base toolkit version 2024.2.
68+
69+
Remember to set your environment variables so that all the tools we are going to use are in your path (replace the first
70+
with the path to your Intel oneAPI Base Toolkit location):
71+
72+
```shell
73+
$ cd /path/to/intel/oneAPI/Toolkit
74+
$ . setvars.sh ~/intel/oneapi
75+
$ dpct --versionIntel(R) DPC++ Compatibility Tool version 2024.2.0. Codebase:(55a3f034030e4bd0f36d7c37f24f8366079a639b). clang version 19.0.0
76+
```
77+
78+
Before we can run our model, we have to download it. There are many models supported
79+
by llama.cpp, and the list keeps growing! In this example we are going to download the llama 2 –7B model, already
80+
quantized in ‘gguf’ format to save some steps, so you can just wget from your prompt. In this case, I have opted for
81+
creating a model's directory in my home folder.
82+
83+
```shell
84+
$ mkdir $HOME/models/ ; cd $HOME/models/
85+
$ wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
86+
```
87+
88+
On your NVIDIA system, you need to have a local copy of oneMKL for NVIDIA GPU’s, this is currently not available as a
89+
download, so you must build it as follows:
90+
91+
```shell
92+
$ git clone https://github.com/oneapi-src/oneMKL.git
93+
$ cd oneMKL/; mkdir build; cd build
94+
$ cmake ../ -GNinja -DCMAKE_CXX_COMPILER=icpx -DCMAKE_C_COMPILER=icx -DENABLE_MKLGPU_BACKEND=False -DENABLE_MKLCPU_BACKEND=False -DENABLE_CUFFT_BACKEND=True -DENABLE_CUBLAS_BACKEND=True -DENABLE_CUSOLVER_BACKEND=True -DENABLE_CURAND_BACKEND=True -DBUILD_FUNCTIONAL_TESTS=False -DCMAKE_INSTALL_PREFIX=${HOME}/soft/mkl/
95+
$ ninja install
96+
```
97+
98+
This builds the [oneMKL interfaces for NVIDIA](https://github.com/oneapi-src/oneMKL) and installs it in the soft/mkl
99+
directory within your home folder.
100+
101+
## Steps for the conversion
102+
103+
The first step is to clone the llama.cpp repository, and configure cmake as usual for NVIDIA GPUs, as shown below.
104+
105+
```shell
106+
$ git clone https://github.com/ggerganov/llama.cpp.git
107+
$ cd llama.cpp
108+
$ git checkout 3c04bf6da89eaf4c7d317e0518f0687dfcbf2de7
109+
$ mkdir build && cd build
110+
$ cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON -
111+
$ DCMAKE_CUDA_ARCHITECTURES=80
112+
```
113+
114+
In this example we are using an earlier version of the llama.cpp repository closer to the one we used to do the initial
115+
porting. The llama.cpp project moves really fast, and some of the latest versions of the project may not work straight
116+
out of the box with SYCLomatic.
117+
118+
Now, here is the first change: pre-pend “intercept-build” to the make command you would normally run, as below:
119+
120+
```shell
121+
$ intercept-build make
122+
```
123+
124+
intercept-build is a really useful tool, distributed with SYCLomatic, that collects all compilation commands issued
125+
while building a yaml file that SYCLomatic can then use to generate new build system files to compile your SYCL version
126+
of the application.
127+
128+
Now we are going to use the information collected by intercept-build to generate a SYCL
129+
build directory by running the dpct command itself:
130+
131+
```shell
132+
$ cd ../.. && mkdir dpct_out
133+
```
134+
135+
```shell
136+
$ dpct -p ./llama.cpp/build --enable-profiling --use-experimental-features=all --in-root=./llama.cpp --out-root=./dpct_out --migrate-build-script=CMake --process-all
137+
```
138+
139+
When using the `-p` option, it will find the compilation database and use that to convert all project files. In this
140+
case, we have also enabled profiling (which adds profiling information to the SYCL generated code), and we are opted in
141+
to all experimental features (more on this later). We are also migrating the build script using CMake, and telling it to
142+
process all files.
143+
144+
## Next Part
145+
146+
Now, we have successfully converted our llama.cpp project from CUDA to SYCL. In part two, we will build and run this on
147+
NVIDIA and Intel GPUs.
148+
149+
[Click here to view part two.](/updates/2024/08/13/part-two-porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time)
Lines changed: 96 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,96 @@
1+
---
2+
title: "Part Two - Porting AI codes from CUDA to SYCL and oneAPI, one llama at a time"
3+
date: 2024-08-13
4+
layout: update
5+
tags:
6+
- cuda
7+
- sycl
8+
- oneapi
9+
- porting
10+
---
11+
12+
## Prelude
13+
14+
[In our first part](/updates/2024/07/31/porting-ai-codes-from-cuda-to-sycl-and-oneapi-one-llama-at-a-time-part-one)
15+
we looked at the conversion from CUDA to SYCL via the whole project migration tool, SYCLomatic. Now we are going to take
16+
this portable code, and run it across an NVIDIA and Intel GPU.
17+
18+
## Building on the NVIDIA system
19+
20+
Now we are going to build the converted code directly using the CMake file that SYCLomatic has created for us, and then
21+
build the main binary for llama.cpp.
22+
23+
```shell
24+
$ cd dpct_out && mkdir syclbuild && cd syclbuild
25+
$ MKLROOT=/home/ruyman/soft/mkl CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80 -DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda -L${MKLROOT}/lib"
26+
$ make main
27+
```
28+
29+
Note that now we are not using the CUDA compiler to build, but the Intel SYCL compiler, so we are passing the CC and CXX
30+
flags accordingly. We also pass manually the target triple (`-fsycl-targets=nvptx64-nvidia-cuda`) which tells the
31+
SYCL compiler to generate code for NVIDIA CUDA architectures (using PTX). We can now run our model using the following
32+
command:
33+
34+
```shell
35+
$ ONEAPI_DEVICE_SELECTOR=cuda:gpu ./bin/main -m ../../models/ -ngl 12899 -no-mmap
36+
```
37+
38+
The environment variable `ONEAPI_DEVICE_SELECTOR` allows users to override the default selection mechanism of the SYCL
39+
queue in favour of a user-defined setting. The default selection in this case would use OpenCL for the CPU, which won’t
40+
work because we explicitly build for NVIDIA GPUs.
41+
42+
The conversion out of the box won’t be fast, as it won’t be using the most optimized path for NVIDIA. But it is a good
43+
starting point that allows you to try your SYCL code on the existing environment before moving to a new machine with an
44+
Intel GPU, and you can also re-use your CI infrastructure to test the SYCL path.
45+
46+
## Running on an Intel GPU system
47+
48+
To prove we have now a truly portable application, let's take this code and build it and run it for an Intel GPU.
49+
50+
Log onto your system with the Intel Data Center Max GPU and repeat the cloning and building for CUDA steps, so you can
51+
run intercept-build on the new system, or copy over the DPCT generated project. Now, let's configure and build for Intel
52+
GPUs, using the original CMake flags we used to convert the project.
53+
54+
```shell
55+
$ CC=icx CXX=icpx cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=80
56+
```
57+
58+
Yes, you still use the CUBLAS and CUDA CMake flags, the user visible CMake flags won’t change, but the internal logic on
59+
the CMake file generated by SYCLomatic will handle finding the paths for the Intel oneAPI base toolkit dependencies.
60+
Once it is configured, you can
61+
62+
```shell
63+
$ make main
64+
```
65+
66+
Which will build llama.cpp for the default target – Intel GPUs (using SPIR-V binaries). To run llama on your Intel GPU,
67+
just use the level zero GPU backend, as shown below:
68+
69+
```shell
70+
$ ONEAPI_DEVICE_SELECTOR=level_zero:gpu ./bin/main -m ../../llama-2-7b-chat.Q4_K_M.gguf --no-mmap -ngl 128
71+
```
72+
73+
Now this is the same application running on an Intel GPU with no user intervention! That means all the heavy lifting is
74+
done by the tool, and you can focus on optimization and refactoring of the generated code.
75+
76+
## Conclusions
77+
78+
In this article we have shown a practical use case of a CUDA to SYCL C++ application for AI, and a popular one at that!
79+
The conversion works straight out of the box, no code changes needed. Typically the SYCLomatic tool is there to assist
80+
you with porting applications from CUDA to SYCL, so it gives you good warning messages and introduces code that you can
81+
then replace later on for code that suits your application better.
82+
83+
We have also shown that the same code works on two completely different GPU’s without any modification, NVIDIA and Intel
84+
with the potential for others through the use of open standard SYCL. Although llama.cpp has a CUDA backend already,
85+
having the SYCL backend run on both platforms means we can re-use CI infrastructure for testing and run the application
86+
in a wider set of platforms with less code changes.
87+
88+
The current SYCL backend supported in upstream llama.cpp started as a DPCT conversion, not too dissimilar to the one we
89+
just did in this article. Developers have been working on the SYCL backend to improve performance on a wide variety of
90+
platforms (NVIDIA, AMD, Intel GPUs on client and datacenter, and others incl RISC-V), but we still re-use some of the
91+
original code that SYCLomatic generated for us. That original conversion saved several engineering months to get
92+
something up and running, and allowed us to focus on the important parts of the project: performance and code quality.
93+
94+
If you want help porting a CUDA application to SYCL, or have questions about anything in this article, reach out to us
95+
96+

static/css/styled.scss

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -811,6 +811,24 @@ body {
811811
width: 100%;
812812
height: auto;
813813
}
814+
815+
code {
816+
padding: .1rem .2rem;
817+
background-color: #d0d0d0;
818+
display: inline-block;
819+
border-radius: 6px;
820+
}
821+
822+
pre code {
823+
display: block;
824+
max-width: 100%;
825+
word-break: break-word;
826+
white-space: break-spaces;
827+
background-color: var(--hint-color);
828+
color: white;
829+
padding: 1rem;
830+
border-radius: 12px;
831+
}
814832
}
815833
}
816834

0 commit comments

Comments
 (0)