Skip to content

Commit 2385328

Browse files
authored
Update LLM section for winter kickstart (#761)
Simplify the LLM page ahead of the winter kickstart: - Discard llama checkpoint section (obsolete). - Remove Jupyter notebook parts (not all users have access to Jupyter with gpus on Triton). - Switch to using the huggingface pipeline in the Python example.
1 parent b4aa046 commit 2385328

File tree

1 file changed

+69
-250
lines changed

1 file changed

+69
-250
lines changed

triton/apps/llms.rst

Lines changed: 69 additions & 250 deletions
Original file line numberDiff line numberDiff line change
@@ -1,283 +1,102 @@
11
LLMs
22
====
33

4-
Large-language models are AI models that can understand and generate
5-
text, primarily using transformer architectures. This page is about
6-
running them on a HPC cluster. This requires
7-
programming experience and knowledge of using the cluster
8-
(:ref:`tutorials`), but allows maximum computational power for the
9-
least cost. :doc:`Aalto RSE </rse/index>` maintains these models and
10-
can provide help with using these, even to users who aren't
11-
computational experts.
4+
Large-language models (LLMs) are AI models that can understand and generate
5+
text, primarily using transformer architectures. They are extensively used for tools and
6+
tasks such as chatbots, translation, summarization, sentiment analysis, and question answering.
127

13-
Because the size of model weights are typically very large and the interest in the
14-
models is high, so we provide our users with pre-downloaded model weights in various formats, along with instructions on how to load these weights for inference purposes, retraining, and fine-tuning tasks. We also provide a dedicated python environment (run ``module load scicomp-llm-env`` to activate it) that has many commonly used python libraries installed for you to test the models quickly.
8+
This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to
9+
get familiar with the basics of using the cluster, including running jobs and using Python (:ref:`tutorials`).
1510

11+
.. note::
12+
13+
If at any point something doesn't work, you are unsure how to get started or proceed, do not hesitate to contact :doc:`the Aalto RSEs </rse/index>`.
14+
15+
You can visit us at :ref:`the daily Zoom help session at 13.00-14.00 <garage>`.
16+
1617

1718
HuggingFace Models
1819
~~~~~~~~~~~~~~~~~~~
19-
The simplest way to use an open-source LLM(Large Language Model) is through the tools and pre-trained models hub from huggingface.
20-
Huggingface is a popular platform for NLP(Natural Language Processing) tasks. It provides a user-friendly interface through the transformers library to load and run various pre-trained models.
21-
Most open-source models from Huggingface are widely supported and integrated with the transformers library.
22-
We are keeping our eyes on the latest models and have downloaded some of them for you. If you need any other models, please contact us.
2320

24-
Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.
21+
The simplest way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their `🤗 Transformers Python library <https://huggingface.co/docs/transformers/en/index>`__.
2522

23+
HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects.
2624

27-
To access Huggingface models:
25+
You can explore their offerings at `🤗 HuggingFace <https://huggingface.co/>`__.
2826

29-
.. tabs::
27+
.. note::
3028

31-
.. group-tab:: slurm/shell script
29+
We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact :doc:`the Aalto RSEs </rse/index>`.
3230

33-
Load the module for huggingface models and setup environment variables:
31+
Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.
3432

35-
.. code-block:: bash
36-
37-
# this will set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
38-
module load model-huggingface/all
33+
Below is an example of how to use the 🤗 Transformers `pipeline() <https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline>`__ to load a pre-trained model and use it for question answering.
3934

40-
# this will force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
41-
export TRANSFORMERS_OFFLINE=1
42-
export HF_HUB_OFFLINE=1
4335

44-
python your_script.py
36+
Example: Question Answering
37+
---------------------------
4538

46-
.. group-tab:: jupyter notebook
39+
In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering.
40+
41+
.. code-block:: bash
42+
43+
#!/bin/bash
44+
#SBATCH --time=00:30:00
45+
#SBATCH --cpus-per-task=4
46+
#SBATCH --mem=40GB
47+
#SBATCH --gpus=1
48+
#SBATCH --output huggingface.%J.out
49+
#SBATCH --error huggingface.%J.err
4750
48-
In jupyter notebook, one can set up all necessary environment variables directly:
51+
# Set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
52+
module load model-huggingface
4953
50-
.. code-block:: python
54+
# Load Python environment to use HuggingFace Transformers
55+
module load scicomp-llm-env
5156
52-
## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
53-
## IMPORTANT: This must be executed before importing the transformers library
54-
import os
55-
os.environ['TRANSFORMERS_OFFLINE'] = '1'
56-
os.environ['HF_HUB_OFFLINE'] = '1'
57-
os.environ['HF_HOME']='/scratch/shareddata/dldata/huggingface-hub-cache'
57+
# Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
58+
export TRANSFORMERS_OFFLINE=1
59+
export HF_HUB_OFFLINE=1
5860
61+
python your_script.py
5962
60-
Here is a Python script using huggingface model.
63+
The `your_script.py` Python script uses a HuggingFace model `mistralai/Mistral-7B-Instruct-v0.1` for conversations and instructions.
6164

6265
.. code-block:: python
6366
64-
from transformers import AutoModelForCausalLM, AutoTokenizer
65-
66-
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
67-
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
68-
69-
prompt = "How many stars in the space?"
70-
71-
model_inputs = tokenizer([prompt], return_tensors="pt")
72-
input_length = model_inputs.input_ids.shape[1]
73-
74-
generated_ids = model.generate(**model_inputs, max_new_tokens=20)
75-
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
76-
77-
Raw model weights
78-
~~~~~~~~~~~~~~~~~~~~~~~~
79-
We also downloaded the following raw llama model weights (PyTorch model checkpoints), and they are managed by the following modules.
80-
81-
.. list-table::
82-
:header-rows: 1
83-
:widths: 1 1 3 2
84-
85-
* * Model type
86-
* Model version
87-
* Module command to load
88-
* Description
89-
90-
* * Llama 2
91-
* Raw Data
92-
* ``module load model-llama2/raw-data``
93-
* Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.
94-
95-
* * Llama 2
96-
* 7b
97-
* ``module load model-llama2/7b``
98-
* Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
99-
100-
* * Llama 2
101-
* 7b-chat
102-
* ``module load model-llama2/7b-chat``
103-
* Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
104-
105-
* * Llama 2
106-
* 13b
107-
* ``module load model-llama2/13b``
108-
* Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
109-
110-
* * Llama 2
111-
* 13b-chat
112-
* ``module load model-llama2/13b-chat``
113-
* Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
114-
115-
* * Llama 2
116-
* 70b
117-
* ``module load model-llama2/70b``
118-
* Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
119-
120-
* * Llama 2
121-
* 70b-chat
122-
* ``module load model-llama2/70b-chat``
123-
* Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
124-
125-
* * CodeLlama
126-
* Raw Data
127-
* ``module load model-codellama/raw-data``
128-
* Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
129-
130-
* * CodeLlama
131-
* 7b
132-
* ``module load model-codellama/7b``
133-
* Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
134-
135-
* * CodeLlama
136-
* 7b-Python
137-
* ``module load model-codellama/7b-python``
138-
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
139-
* * CodeLlama
140-
* 7b-Instruct
141-
* ``module load model-codellama/7b-instruct``
142-
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
143-
144-
* * CodeLlama
145-
* 13b
146-
* ``module load model-codellama/13b``
147-
* Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
148-
149-
* * CodeLlama
150-
* 13b-Python
151-
* ``module load model-codellama/13b-python``
152-
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
153-
* * CodeLlama
154-
* 13b-Instruct
155-
* ``module load model-codellama/13b-instruct``
156-
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
157-
158-
* * CodeLlama
159-
* 34b
160-
* ``module load model-codellama/34b``
161-
* Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
162-
163-
* * CodeLlama
164-
* 34b-Python
165-
* ``module load model-codellama/34b-python``
166-
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
167-
* * CodeLlama
168-
* 34b-Instruct
169-
* ``module load model-codellama/34b-instruct``
170-
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
171-
172-
Each module will set the following environment variables:
173-
174-
- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
175-
- ``TOKENIZER_PATH`` - File path to the tokenizer.model.
176-
177-
Here is an example :doc:`slurm script </triton/tut/slurm>`, using the raw weights for batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
178-
179-
.. code-block:: slurm
180-
181-
#!/bin/bash
182-
#SBATCH --time=00:25:00
183-
#SBATCH --cpus-per-task=4
184-
#SBATCH --mem=20GB
185-
#SBATCH --gpus=1
186-
#SBATCH --output llama2inference-gpu.%J.out
187-
#SBATCH --error llama2inference-gpu.%J.err
188-
189-
# get access to the model weights
190-
module load model-llama2/7b
191-
echo $MODEL_ROOT
192-
# Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
193-
echo $TOKENIZER_PATH
194-
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
195-
196-
# activate your conda environment
197-
module load mamba
198-
source activate llama2env
199-
200-
# run batch inference
201-
torchrun --nproc_per_node 1 batch_inference.py \
202-
--prompts prompts.json \
203-
--ckpt_dir $MODEL_ROOT \
204-
--tokenizer_path $TOKENIZER_PATH \
205-
--max_seq_len 512 --max_batch_size 16
206-
207-
llama.cpp and GGUF model weights
208-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
209-
210-
`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is another popular framework
211-
for running inference on LLM models with CPUs or GPUs. It provides C++ implementations of many large language models. llama.cpp uses a format called GGUF as its storage format.
212-
We have GGUF conversions of all Llama 2 and CodeLlama models with multiple quantization levels.
213-
Please contact us if you need any other GGUF models.
214-
NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.
215-
216-
.. list-table::
217-
:header-rows: 1
218-
:widths: 1 1 3 2
219-
220-
* * Model type
221-
* Model version
222-
* Module command to load
223-
* Description
224-
225-
* * Llama 2
226-
* f16-2023-08-28
227-
* ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
228-
* Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
229-
230-
* * Llama 2
231-
* q4_0-2023-08-28
232-
* ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
233-
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
234-
235-
* * Llama 2
236-
* q4_1-2023-08-28
237-
* ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
238-
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
239-
240-
* * Llama 2
241-
* q8_0-2023-08-28
242-
* ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
243-
* 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
244-
245-
* * CodeLlama
246-
* f16-2023-08-28
247-
* ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
248-
* Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
249-
250-
* * CodeLlama
251-
* q4_0-2023-08-28
252-
* ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
253-
* 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
254-
255-
* * CodeLlama
256-
* q8_0-2023-08-28
257-
* ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
258-
* 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
259-
260-
Each module will set the following environment variables:
261-
262-
- ``MODEL_ROOT`` - Folder where model weights are stored.
263-
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.
264-
265-
This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
266-
NOTE: this example repo is mainly meant to run on CPUs, if you want to run on GPUs, you can checkout a branch "llamacpp-gpu" of this repo for details.
67+
from transformers import pipeline
68+
import torch
26769
268-
.. code-block:: python
269-
270-
import os
271-
from langchain.llms import LlamaCpp
70+
# Initialize pipeline
71+
pipe = pipeline(
72+
"text-generation", # Task type
73+
model="mistralai/Mistral-7B-Instruct-v0.1", # Model name
74+
device="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available
75+
max_new_tokens=1000
76+
)
27277
273-
model_path = os.environ.get('MODEL_WEIGHTS')
274-
llm = LlamaCpp(model_path=model_path, verbose=False)
78+
# Prepare prompts
79+
prompts = ["Continue the following sequence: 1, 2, 3, 5, 8", "What is the meaning of life?"]
80+
81+
# Generate and print responses
82+
responses = pipe(prompts)
83+
print(responses)
84+
85+
You can look at the `model card <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1>`__ for more information about the model.
86+
87+
88+
Other Frameworks
89+
~~~~~~~~~~~~~~~~
90+
91+
While HuggingFace provides a convenient way to access and use LLMs, there are other frameworks available for running LLMs,
92+
such as `DeepSpeed <https://www.deepspeed.ai/tutorials/inference-tutorial/>`__ and `LangChain <https://python.langchain.com/docs/how_to/local_llms/>`__.
93+
94+
If you need assistance running LLMs in these or other frameworks, please contact :doc:`the Aalto RSEs </rse/index>`.
27595

27696

27797
More examples
278-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
98+
~~~~~~~~~~~~~
99+
100+
AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it `here <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.
279101

280-
Starting a local API
281-
--------------------------
282-
With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.
283102

0 commit comments

Comments
 (0)