|
1 | 1 | LLMs |
2 | 2 | ==== |
3 | 3 |
|
4 | | -Large-language models are AI models that can understand and generate |
5 | | -text, primarily using transformer architectures. This page is about |
6 | | -running them on a HPC cluster. This requires |
7 | | -programming experience and knowledge of using the cluster |
8 | | -(:ref:`tutorials`), but allows maximum computational power for the |
9 | | -least cost. :doc:`Aalto RSE </rse/index>` maintains these models and |
10 | | -can provide help with using these, even to users who aren't |
11 | | -computational experts. |
| 4 | +Large-language models (LLMs) are AI models that can understand and generate |
| 5 | +text, primarily using transformer architectures. They are extensively used for tools and |
| 6 | +tasks such as chatbots, translation, summarization, sentiment analysis, and question answering. |
12 | 7 |
|
13 | | -Because the size of model weights are typically very large and the interest in the |
14 | | -models is high, so we provide our users with pre-downloaded model weights in various formats, along with instructions on how to load these weights for inference purposes, retraining, and fine-tuning tasks. We also provide a dedicated python environment (run ``module load scicomp-llm-env`` to activate it) that has many commonly used python libraries installed for you to test the models quickly. |
| 8 | +This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to |
| 9 | +get familiar with the basics of using the cluster, including running jobs and using Python (:ref:`tutorials`). |
15 | 10 |
|
| 11 | +.. note:: |
| 12 | + |
| 13 | + If at any point something doesn't work, you are unsure how to get started or proceed, do not hesitate to contact :doc:`the Aalto RSEs </rse/index>`. |
| 14 | + |
| 15 | + You can visit us at :ref:`the daily Zoom help session at 13.00-14.00 <garage>`. |
| 16 | + |
16 | 17 |
|
17 | 18 | HuggingFace Models |
18 | 19 | ~~~~~~~~~~~~~~~~~~~ |
19 | | -The simplest way to use an open-source LLM(Large Language Model) is through the tools and pre-trained models hub from huggingface. |
20 | | -Huggingface is a popular platform for NLP(Natural Language Processing) tasks. It provides a user-friendly interface through the transformers library to load and run various pre-trained models. |
21 | | -Most open-source models from Huggingface are widely supported and integrated with the transformers library. |
22 | | -We are keeping our eyes on the latest models and have downloaded some of them for you. If you need any other models, please contact us. |
23 | 20 |
|
24 | | -Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models. |
| 21 | +The simplest way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their `🤗 Transformers Python library <https://huggingface.co/docs/transformers/en/index>`__. |
25 | 22 |
|
| 23 | +HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects. |
26 | 24 |
|
27 | | -To access Huggingface models: |
| 25 | +You can explore their offerings at `🤗 HuggingFace <https://huggingface.co/>`__. |
28 | 26 |
|
29 | | -.. tabs:: |
| 27 | +.. note:: |
30 | 28 |
|
31 | | - .. group-tab:: slurm/shell script |
| 29 | + We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact :doc:`the Aalto RSEs </rse/index>`. |
32 | 30 |
|
33 | | - Load the module for huggingface models and setup environment variables: |
| 31 | + Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models. |
34 | 32 |
|
35 | | - .. code-block:: bash |
36 | | - |
37 | | - # this will set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache |
38 | | - module load model-huggingface/all |
| 33 | +Below is an example of how to use the 🤗 Transformers `pipeline() <https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline>`__ to load a pre-trained model and use it for question answering. |
39 | 34 |
|
40 | | - # this will force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. |
41 | | - export TRANSFORMERS_OFFLINE=1 |
42 | | - export HF_HUB_OFFLINE=1 |
43 | 35 |
|
44 | | - python your_script.py |
| 36 | +Example: Question Answering |
| 37 | +--------------------------- |
45 | 38 |
|
46 | | - .. group-tab:: jupyter notebook |
| 39 | +In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering. |
| 40 | + |
| 41 | + .. code-block:: bash |
| 42 | + |
| 43 | + #!/bin/bash |
| 44 | + #SBATCH --time=00:30:00 |
| 45 | + #SBATCH --cpus-per-task=4 |
| 46 | + #SBATCH --mem=40GB |
| 47 | + #SBATCH --gpus=1 |
| 48 | + #SBATCH --output huggingface.%J.out |
| 49 | + #SBATCH --error huggingface.%J.err |
47 | 50 |
|
48 | | - In jupyter notebook, one can set up all necessary environment variables directly: |
| 51 | + # Set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache |
| 52 | + module load model-huggingface |
49 | 53 |
|
50 | | - .. code-block:: python |
| 54 | + # Load Python environment to use HuggingFace Transformers |
| 55 | + module load scicomp-llm-env |
51 | 56 |
|
52 | | - ## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. |
53 | | - ## IMPORTANT: This must be executed before importing the transformers library |
54 | | - import os |
55 | | - os.environ['TRANSFORMERS_OFFLINE'] = '1' |
56 | | - os.environ['HF_HUB_OFFLINE'] = '1' |
57 | | - os.environ['HF_HOME']='/scratch/shareddata/dldata/huggingface-hub-cache' |
| 57 | + # Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. |
| 58 | + export TRANSFORMERS_OFFLINE=1 |
| 59 | + export HF_HUB_OFFLINE=1 |
58 | 60 |
|
| 61 | + python your_script.py |
59 | 62 |
|
60 | | -Here is a Python script using huggingface model. |
| 63 | +The `your_script.py` Python script uses a HuggingFace model `mistralai/Mistral-7B-Instruct-v0.1` for conversations and instructions. |
61 | 64 |
|
62 | 65 | .. code-block:: python |
63 | 66 |
|
64 | | - from transformers import AutoModelForCausalLM, AutoTokenizer |
65 | | -
|
66 | | - tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") |
67 | | - model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") |
68 | | -
|
69 | | - prompt = "How many stars in the space?" |
70 | | -
|
71 | | - model_inputs = tokenizer([prompt], return_tensors="pt") |
72 | | - input_length = model_inputs.input_ids.shape[1] |
73 | | -
|
74 | | - generated_ids = model.generate(**model_inputs, max_new_tokens=20) |
75 | | - print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) |
76 | | -
|
77 | | -Raw model weights |
78 | | -~~~~~~~~~~~~~~~~~~~~~~~~ |
79 | | -We also downloaded the following raw llama model weights (PyTorch model checkpoints), and they are managed by the following modules. |
80 | | - |
81 | | -.. list-table:: |
82 | | - :header-rows: 1 |
83 | | - :widths: 1 1 3 2 |
84 | | - |
85 | | - * * Model type |
86 | | - * Model version |
87 | | - * Module command to load |
88 | | - * Description |
89 | | - |
90 | | - * * Llama 2 |
91 | | - * Raw Data |
92 | | - * ``module load model-llama2/raw-data`` |
93 | | - * Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__. |
94 | | - |
95 | | - * * Llama 2 |
96 | | - * 7b |
97 | | - * ``module load model-llama2/7b`` |
98 | | - * Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
99 | | - |
100 | | - * * Llama 2 |
101 | | - * 7b-chat |
102 | | - * ``module load model-llama2/7b-chat`` |
103 | | - * Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
104 | | - |
105 | | - * * Llama 2 |
106 | | - * 13b |
107 | | - * ``module load model-llama2/13b`` |
108 | | - * Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
109 | | - |
110 | | - * * Llama 2 |
111 | | - * 13b-chat |
112 | | - * ``module load model-llama2/13b-chat`` |
113 | | - * Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
114 | | - |
115 | | - * * Llama 2 |
116 | | - * 70b |
117 | | - * ``module load model-llama2/70b`` |
118 | | - * Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
119 | | - |
120 | | - * * Llama 2 |
121 | | - * 70b-chat |
122 | | - * ``module load model-llama2/70b-chat`` |
123 | | - * Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
124 | | - |
125 | | - * * CodeLlama |
126 | | - * Raw Data |
127 | | - * ``module load model-codellama/raw-data`` |
128 | | - * Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
129 | | - |
130 | | - * * CodeLlama |
131 | | - * 7b |
132 | | - * ``module load model-codellama/7b`` |
133 | | - * Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
134 | | - |
135 | | - * * CodeLlama |
136 | | - * 7b-Python |
137 | | - * ``module load model-codellama/7b-python`` |
138 | | - * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
139 | | - * * CodeLlama |
140 | | - * 7b-Instruct |
141 | | - * ``module load model-codellama/7b-instruct`` |
142 | | - * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
143 | | - |
144 | | - * * CodeLlama |
145 | | - * 13b |
146 | | - * ``module load model-codellama/13b`` |
147 | | - * Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
148 | | - |
149 | | - * * CodeLlama |
150 | | - * 13b-Python |
151 | | - * ``module load model-codellama/13b-python`` |
152 | | - * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
153 | | - * * CodeLlama |
154 | | - * 13b-Instruct |
155 | | - * ``module load model-codellama/13b-instruct`` |
156 | | - * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
157 | | - |
158 | | - * * CodeLlama |
159 | | - * 34b |
160 | | - * ``module load model-codellama/34b`` |
161 | | - * Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
162 | | - |
163 | | - * * CodeLlama |
164 | | - * 34b-Python |
165 | | - * ``module load model-codellama/34b-python`` |
166 | | - * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
167 | | - * * CodeLlama |
168 | | - * 34b-Instruct |
169 | | - * ``module load model-codellama/34b-instruct`` |
170 | | - * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
171 | | - |
172 | | -Each module will set the following environment variables: |
173 | | - |
174 | | -- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory. |
175 | | -- ``TOKENIZER_PATH`` - File path to the tokenizer.model. |
176 | | - |
177 | | -Here is an example :doc:`slurm script </triton/tut/slurm>`, using the raw weights for batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__. |
178 | | - |
179 | | -.. code-block:: slurm |
180 | | -
|
181 | | - #!/bin/bash |
182 | | - #SBATCH --time=00:25:00 |
183 | | - #SBATCH --cpus-per-task=4 |
184 | | - #SBATCH --mem=20GB |
185 | | - #SBATCH --gpus=1 |
186 | | - #SBATCH --output llama2inference-gpu.%J.out |
187 | | - #SBATCH --error llama2inference-gpu.%J.err |
188 | | -
|
189 | | - # get access to the model weights |
190 | | - module load model-llama2/7b |
191 | | - echo $MODEL_ROOT |
192 | | - # Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b |
193 | | - echo $TOKENIZER_PATH |
194 | | - # Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model |
195 | | - |
196 | | - # activate your conda environment |
197 | | - module load mamba |
198 | | - source activate llama2env |
199 | | -
|
200 | | - # run batch inference |
201 | | - torchrun --nproc_per_node 1 batch_inference.py \ |
202 | | - --prompts prompts.json \ |
203 | | - --ckpt_dir $MODEL_ROOT \ |
204 | | - --tokenizer_path $TOKENIZER_PATH \ |
205 | | - --max_seq_len 512 --max_batch_size 16 |
206 | | - |
207 | | -llama.cpp and GGUF model weights |
208 | | -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
209 | | - |
210 | | -`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is another popular framework |
211 | | -for running inference on LLM models with CPUs or GPUs. It provides C++ implementations of many large language models. llama.cpp uses a format called GGUF as its storage format. |
212 | | -We have GGUF conversions of all Llama 2 and CodeLlama models with multiple quantization levels. |
213 | | -Please contact us if you need any other GGUF models. |
214 | | -NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file. |
215 | | - |
216 | | -.. list-table:: |
217 | | - :header-rows: 1 |
218 | | - :widths: 1 1 3 2 |
219 | | - |
220 | | - * * Model type |
221 | | - * Model version |
222 | | - * Module command to load |
223 | | - * Description |
224 | | - |
225 | | - * * Llama 2 |
226 | | - * f16-2023-08-28 |
227 | | - * ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
228 | | - * Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
229 | | - |
230 | | - * * Llama 2 |
231 | | - * q4_0-2023-08-28 |
232 | | - * ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
233 | | - * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
234 | | - |
235 | | - * * Llama 2 |
236 | | - * q4_1-2023-08-28 |
237 | | - * ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights) |
238 | | - * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
239 | | - |
240 | | - * * Llama 2 |
241 | | - * q8_0-2023-08-28 |
242 | | - * ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
243 | | - * 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
244 | | - |
245 | | - * * CodeLlama |
246 | | - * f16-2023-08-28 |
247 | | - * ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
248 | | - * Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
249 | | - |
250 | | - * * CodeLlama |
251 | | - * q4_0-2023-08-28 |
252 | | - * ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
253 | | - * 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
254 | | - |
255 | | - * * CodeLlama |
256 | | - * q8_0-2023-08-28 |
257 | | - * ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
258 | | - * 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
259 | | - |
260 | | -Each module will set the following environment variables: |
261 | | - |
262 | | -- ``MODEL_ROOT`` - Folder where model weights are stored. |
263 | | -- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format. |
264 | | - |
265 | | -This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__. |
266 | | -NOTE: this example repo is mainly meant to run on CPUs, if you want to run on GPUs, you can checkout a branch "llamacpp-gpu" of this repo for details. |
| 67 | + from transformers import pipeline |
| 68 | + import torch |
267 | 69 |
|
268 | | -.. code-block:: python |
269 | | - |
270 | | - import os |
271 | | - from langchain.llms import LlamaCpp |
| 70 | + # Initialize pipeline |
| 71 | + pipe = pipeline( |
| 72 | + "text-generation", # Task type |
| 73 | + model="mistralai/Mistral-7B-Instruct-v0.1", # Model name |
| 74 | + device="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available |
| 75 | + max_new_tokens=1000 |
| 76 | + ) |
272 | 77 |
|
273 | | - model_path = os.environ.get('MODEL_WEIGHTS') |
274 | | - llm = LlamaCpp(model_path=model_path, verbose=False) |
| 78 | + # Prepare prompts |
| 79 | + prompts = ["Continue the following sequence: 1, 2, 3, 5, 8", "What is the meaning of life?"] |
| 80 | +
|
| 81 | + # Generate and print responses |
| 82 | + responses = pipe(prompts) |
| 83 | + print(responses) |
| 84 | +
|
| 85 | +You can look at the `model card <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1>`__ for more information about the model. |
| 86 | + |
| 87 | + |
| 88 | +Other Frameworks |
| 89 | +~~~~~~~~~~~~~~~~ |
| 90 | + |
| 91 | +While HuggingFace provides a convenient way to access and use LLMs, there are other frameworks available for running LLMs, |
| 92 | +such as `DeepSpeed <https://www.deepspeed.ai/tutorials/inference-tutorial/>`__ and `LangChain <https://python.langchain.com/docs/how_to/local_llms/>`__. |
| 93 | + |
| 94 | +If you need assistance running LLMs in these or other frameworks, please contact :doc:`the Aalto RSEs </rse/index>`. |
275 | 95 |
|
276 | 96 |
|
277 | 97 | More examples |
278 | | -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 98 | +~~~~~~~~~~~~~ |
| 99 | + |
| 100 | +AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it `here <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__. |
279 | 101 |
|
280 | | -Starting a local API |
281 | | --------------------------- |
282 | | -With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__. |
283 | 102 |
|
0 commit comments