Skip to content

Commit df07884

Browse files
authored
Merge pull request #680 from AaltoSciComp/llm-page
triton/apps/llms: Adding a new page about LMMs
2 parents 859d642 + bf72254 commit df07884

File tree

1 file changed

+332
-0
lines changed

1 file changed

+332
-0
lines changed

triton/apps/llms.rst

Lines changed: 332 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,332 @@
1+
LLMs
2+
====
3+
4+
5+
Large-language models are AI models that can understand and generate text,
6+
primarily using transformer architectures.
7+
8+
Because the model weights are typically very large and the interest in the
9+
models is high, so we provide our users with pre-downloaded model weights and instructions on how to load these weights for inference purposes or for retraining and fine-tuning the models.
10+
11+
12+
Pre-downloaded model weights
13+
----------------------------
14+
Raw model weights
15+
~~~~~~~~~~~~~~~~~
16+
We have downloaded the following raw model weights (PyTorch model checkpoints):
17+
18+
.. list-table::
19+
:header-rows: 1
20+
:widths: 1 1 3 2
21+
22+
* * Model type
23+
* Model version
24+
* Module command to load
25+
* Description
26+
27+
* * Llama 2
28+
* Raw Data
29+
* ``module load model-llama2/raw-data``
30+
* Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.
31+
32+
* * Llama 2
33+
* 7b
34+
* ``module load model-llama2/7b``
35+
* Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
36+
37+
* * Llama 2
38+
* 7b-chat
39+
* ``module load model-llama2/7b-chat``
40+
* Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
41+
42+
* * Llama 2
43+
* 13b
44+
* ``module load model-llama2/13b``
45+
* Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
46+
47+
* * Llama 2
48+
* 13b-chat
49+
* ``module load model-llama2/13b-chat``
50+
* Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
51+
52+
* * Llama 2
53+
* 70b
54+
* ``module load model-llama2/70b``
55+
* Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
56+
57+
* * Llama 2
58+
* 70b-chat
59+
* ``module load model-llama2/70b-chat``
60+
* Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
61+
62+
* * CodeLlama
63+
* Raw Data
64+
* ``module load model-codellama/raw-data``
65+
* Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
66+
67+
* * CodeLlama
68+
* 7b
69+
* ``module load model-codellama/7b``
70+
* Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
71+
72+
* * CodeLlama
73+
* 7b-Python
74+
* ``module load model-codellama/7b-python``
75+
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
76+
* * CodeLlama
77+
* 7b-Instruct
78+
* ``module load model-codellama/7b-instruct``
79+
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
80+
81+
* * CodeLlama
82+
* 13b
83+
* ``module load model-codellama/13b``
84+
* Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
85+
86+
* * CodeLlama
87+
* 13b-Python
88+
* ``module load model-codellama/13b-python``
89+
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
90+
* * CodeLlama
91+
* 13b-Instruct
92+
* ``module load model-codellama/13b-instruct``
93+
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
94+
95+
* * CodeLlama
96+
* 34b
97+
* ``module load model-codellama/34b``
98+
* Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
99+
100+
* * CodeLlama
101+
* 34b-Python
102+
* ``module load model-codellama/34b-python``
103+
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
104+
* * CodeLlama
105+
* 34b-Instruct
106+
* ``module load model-codellama/34b-instruct``
107+
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
108+
109+
Each module will set the following environment variables:
110+
111+
- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
112+
- ``TOKENIZER_PATH`` - File path to the tokenizer.model.
113+
114+
Here is an example `slurm <https://scicomp.aalto.fi/triton/tut/slurm/>`__, script using the raw weights to do batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
115+
116+
.. code-block:: slurm
117+
118+
#!/bin/bash
119+
#SBATCH --time=00:25:00
120+
#SBATCH --cpus_per_task=4
121+
#SBATCH --mem=20GB
122+
#SBATCH --gres=gpu:1
123+
#SBATCH --output=llama2inference-gpu.%J.out
124+
#SBATCH --error=llama2inference-gpu.%J.err
125+
126+
# get the model weights
127+
module load model-llama2/7b
128+
echo $MODEL_ROOT
129+
# Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
130+
echo $TOKENIZER_PATH
131+
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
132+
133+
# activate your conda environment
134+
module load miniconda
135+
source activate llama2env
136+
137+
# run batch inference
138+
torchrun --nproc_per_node 1 batch_inference.py \
139+
--prompts prompts.json \
140+
--ckpt_dir $MODEL_ROOT \
141+
--tokenizer_path $TOKENIZER_PATH \
142+
--max_seq_len 512 --max_batch_size 16
143+
144+
Model weight conversions
145+
------------------------
146+
Usually, models produced in research are stored as weights from PyTorch or other
147+
frameworks. As for inference, we also have models that are already converted to different formats.
148+
149+
150+
Huggingface Models
151+
~~~~~~~~~~~~~~~~~~~
152+
153+
154+
Currently, we have the following Huggingface models stored on triton. Please contact us if you need any other models.
155+
156+
.. list-table::
157+
:header-rows: 1
158+
:widths: 1 1
159+
160+
* * Model type
161+
* Huggingface model identifier
162+
163+
* * Text Generation
164+
* mistralai/Mistral-7B-v0.1
165+
166+
* * Text Generation
167+
* mistralai/Mistral-7B-Instruct-v0.1
168+
169+
* * Text Generation
170+
* tiiuae/falcon-7b
171+
172+
* * Text Generation
173+
* tiiuae/falcon-7b-instruct
174+
175+
* * Text Generation
176+
* tiiuae/falcon-40b
177+
178+
* * Text Generation
179+
* tiiuae/falcon-40b-instruct
180+
181+
* * Text Generation
182+
* meta-llama/Llama-2-7b-hf
183+
184+
* * Text Generation
185+
* meta-llama/Llama-2-13b-hf
186+
187+
* * Text Generation
188+
* meta-llama/Llama-2-70b-hf
189+
190+
* * Text Generation
191+
* codellama/CodeLlama-7b-hf
192+
193+
* * Text Generation
194+
* codellama/CodeLlama-13b-hf
195+
196+
* * Text Generation
197+
* codellama/CodeLlama-34b-hf
198+
199+
* * Translation
200+
* Helsinki-NLP/opus-mt-en-fi
201+
202+
* * Translation
203+
* Helsinki-NLP/opus-mt-fi-en
204+
205+
* * Translation
206+
* t5-base
207+
208+
* * Fill Mask
209+
* bert-base-uncased
210+
211+
* * Fill Mask
212+
* bert-base-cased
213+
214+
* * Fill Mask
215+
* distilbert-base-uncased
216+
217+
* * Text to Speech
218+
* microsoft/speecht5_hifigan
219+
220+
* * Text to Speech
221+
* facebook/hf-seamless-m4t-large
222+
223+
* * Automatic Speech Recognition
224+
* openai/whisper-large-v3
225+
226+
* * Token Classification
227+
* dslim/bert-base-NER-uncased
228+
229+
230+
231+
All Huggingface models can be loaded with ``module load model-huggingface/all``.
232+
Here is a Python script using huggingface model.
233+
234+
.. code-block:: python
235+
236+
## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. NOTE: this must be run before importing transformers.
237+
import os
238+
os.environ['TRANSFORMERS_OFFLINE'] = '1'
239+
240+
from transformers import AutoModelForCausalLM, AutoTokenizer
241+
242+
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
243+
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
244+
245+
prompt = "How many stars in the space?"
246+
247+
model_inputs = tokenizer([prompt], return_tensors="pt")
248+
input_length = model_inputs.input_ids.shape[1]
249+
250+
generated_ids = model.generate(**model_inputs, max_new_tokens=20)
251+
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
252+
253+
254+
255+
llama.cpp and GGUF
256+
~~~~~~~~~~~~~~~~~~~
257+
258+
`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is a popular framework
259+
for running inference on LLM models with CPUs or GPUs. llama.cpp uses a format
260+
called GGUF as its storage format.
261+
262+
We have llama.cpp conversions of all Llama 2 and CodeLlama models with multiple quantization levels.
263+
264+
NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.
265+
266+
.. list-table::
267+
:header-rows: 1
268+
:widths: 1 1 3 2
269+
270+
* * Model type
271+
* Model version
272+
* Module command to load
273+
* Description
274+
275+
* * Llama 2
276+
* f16-2023-08-28
277+
* ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
278+
* Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
279+
280+
* * Llama 2
281+
* q4_0-2023-08-28
282+
* ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
283+
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
284+
285+
* * Llama 2
286+
* q4_1-2023-08-28
287+
* ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
288+
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
289+
290+
* * Llama 2
291+
* q8_0-2023-08-28
292+
* ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
293+
* 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
294+
295+
* * CodeLlama
296+
* f16-2023-08-28
297+
* ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
298+
* Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
299+
300+
* * CodeLlama
301+
* q4_0-2023-08-28
302+
* ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
303+
* 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
304+
305+
* * CodeLlama
306+
* q8_0-2023-08-28
307+
* ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
308+
* 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
309+
310+
Each module will set the following environment variables:
311+
312+
- ``MODEL_ROOT`` - Folder where model weights are stored.
313+
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.
314+
315+
This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
316+
317+
.. code-block:: python
318+
319+
import os
320+
from langchain.llms import LlamaCpp
321+
322+
model_path = os.environ.get('MODEL_WEIGHTS')
323+
llm = LlamaCpp(model_path=model_path, verbose=False)
324+
325+
326+
More examples
327+
------------------------------------------------------------
328+
329+
Starting a local API
330+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
331+
With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.
332+

0 commit comments

Comments
 (0)