Skip to content

Commit bf72254

Browse files
author
Yu Tian
committed
Adding CodeLlama and Huggingface Models
1 parent 07af4e0 commit bf72254

File tree

1 file changed

+163
-48
lines changed

1 file changed

+163
-48
lines changed

triton/apps/llms.rst

Lines changed: 163 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -2,21 +2,18 @@ LLMs
22
====
33

44

5-
.. highlight:: bash
6-
7-
Large-language models are AI models that can understand and generate text
8-
using transformer architectures.
5+
Large-language models are AI models that can understand and generate text,
6+
primarily using transformer architectures.
97

108
Because the model weights are typically very large and the interest in the
11-
models is high, we provide our users pre-downloaded model weights and
12-
instructions on how to run inference and training on the models.
9+
models is high, so we provide our users with pre-downloaded model weights and instructions on how to load these weights for inference purposes or for retraining and fine-tuning the models.
1310

1411

1512
Pre-downloaded model weights
1613
----------------------------
1714
Raw model weights
1815
~~~~~~~~~~~~~~~~~
19-
We have downloaded the following models weights (PyTorch model checkpoint directories):
16+
We have downloaded the following raw model weights (PyTorch model checkpoints):
2017

2118
.. list-table::
2219
:header-rows: 1
@@ -28,7 +25,7 @@ We have downloaded the following models weights (PyTorch model checkpoint direct
2825
* Description
2926

3027
* * Llama 2
31-
* Raw data
28+
* Raw Data
3229
* ``module load model-llama2/raw-data``
3330
* Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.
3431

@@ -62,12 +59,59 @@ We have downloaded the following models weights (PyTorch model checkpoint direct
6259
* ``module load model-llama2/70b-chat``
6360
* Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
6461

62+
* * CodeLlama
63+
* Raw Data
64+
* ``module load model-codellama/raw-data``
65+
* Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
66+
67+
* * CodeLlama
68+
* 7b
69+
* ``module load model-codellama/7b``
70+
* Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
71+
72+
* * CodeLlama
73+
* 7b-Python
74+
* ``module load model-codellama/7b-python``
75+
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
76+
* * CodeLlama
77+
* 7b-Instruct
78+
* ``module load model-codellama/7b-instruct``
79+
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
80+
81+
* * CodeLlama
82+
* 13b
83+
* ``module load model-codellama/13b``
84+
* Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
85+
86+
* * CodeLlama
87+
* 13b-Python
88+
* ``module load model-codellama/13b-python``
89+
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
90+
* * CodeLlama
91+
* 13b-Instruct
92+
* ``module load model-codellama/13b-instruct``
93+
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
94+
95+
* * CodeLlama
96+
* 34b
97+
* ``module load model-codellama/34b``
98+
* Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
99+
100+
* * CodeLlama
101+
* 34b-Python
102+
* ``module load model-codellama/34b-python``
103+
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
104+
* * CodeLlama
105+
* 34b-Instruct
106+
* ``module load model-codellama/34b-instruct``
107+
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
108+
65109
Each module will set the following environment variables:
66110

67111
- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
68112
- ``TOKENIZER_PATH`` - File path to the tokenizer.model.
69113

70-
Here is an example slurm script using the raw weights to do batch inference. For detailed environment setting up, example prompts and python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
114+
Here is an example `slurm <https://scicomp.aalto.fi/triton/tut/slurm/>`__, script using the raw weights to do batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
71115

72116
.. code-block:: slurm
73117
@@ -86,7 +130,7 @@ Here is an example slurm script using the raw weights to do batch inference. For
86130
echo $TOKENIZER_PATH
87131
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
88132
89-
# activate conda environment
133+
# activate your conda environment
90134
module load miniconda
91135
source activate llama2env
92136
@@ -99,46 +143,110 @@ Here is an example slurm script using the raw weights to do batch inference. For
99143
100144
Model weight conversions
101145
------------------------
102-
Usually models produced in research are stored as weights from PyTorch or other
146+
Usually, models produced in research are stored as weights from PyTorch or other
103147
frameworks. As for inference, we also have models that are already converted to different formats.
104148

105149

106150
Huggingface Models
107151
~~~~~~~~~~~~~~~~~~~
108152

109153

110-
We have the following Huggingface models stored:
154+
Currently, we have the following Huggingface models stored on triton. Please contact us if you need any other models.
111155

112156
.. list-table::
113157
:header-rows: 1
114-
:widths: 1 1 3 2
158+
:widths: 1 1
115159

116160
* * Model type
117-
* Model version
118-
* Module command to load
119-
* Description
161+
* Huggingface model identifier
120162

121-
* * Llama 2
122-
*
123-
* Module command to load
124-
* Description
163+
* * Text Generation
164+
* mistralai/Mistral-7B-v0.1
165+
166+
* * Text Generation
167+
* mistralai/Mistral-7B-Instruct-v0.1
125168

126-
All Huggingface models can be loaded with: ``module load model-huggingface/all``,
127-
Here is a python script using huggingface model.
169+
* * Text Generation
170+
* tiiuae/falcon-7b
171+
172+
* * Text Generation
173+
* tiiuae/falcon-7b-instruct
174+
175+
* * Text Generation
176+
* tiiuae/falcon-40b
177+
178+
* * Text Generation
179+
* tiiuae/falcon-40b-instruct
180+
181+
* * Text Generation
182+
* meta-llama/Llama-2-7b-hf
183+
184+
* * Text Generation
185+
* meta-llama/Llama-2-13b-hf
186+
187+
* * Text Generation
188+
* meta-llama/Llama-2-70b-hf
189+
190+
* * Text Generation
191+
* codellama/CodeLlama-7b-hf
192+
193+
* * Text Generation
194+
* codellama/CodeLlama-13b-hf
195+
196+
* * Text Generation
197+
* codellama/CodeLlama-34b-hf
198+
199+
* * Translation
200+
* Helsinki-NLP/opus-mt-en-fi
201+
202+
* * Translation
203+
* Helsinki-NLP/opus-mt-fi-en
204+
205+
* * Translation
206+
* t5-base
207+
208+
* * Fill Mask
209+
* bert-base-uncased
210+
211+
* * Fill Mask
212+
* bert-base-cased
213+
214+
* * Fill Mask
215+
* distilbert-base-uncased
216+
217+
* * Text to Speech
218+
* microsoft/speecht5_hifigan
219+
220+
* * Text to Speech
221+
* facebook/hf-seamless-m4t-large
222+
223+
* * Automatic Speech Recognition
224+
* openai/whisper-large-v3
225+
226+
* * Token Classification
227+
* dslim/bert-base-NER-uncased
228+
229+
230+
231+
All Huggingface models can be loaded with ``module load model-huggingface/all``.
232+
Here is a Python script using huggingface model.
128233

129234
.. code-block:: python
130235
131-
#force transformer to use local hub instead of download from remote hub
236+
## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. NOTE: this must be run before importing transformers.
132237
import os
133238
os.environ['TRANSFORMERS_OFFLINE'] = '1'
134239
135240
from transformers import AutoModelForCausalLM, AutoTokenizer
136241
137242
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
138243
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
244+
139245
prompt = "How many stars in the space?"
246+
140247
model_inputs = tokenizer([prompt], return_tensors="pt")
141248
input_length = model_inputs.input_ids.shape[1]
249+
142250
generated_ids = model.generate(**model_inputs, max_new_tokens=20)
143251
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
144252
@@ -151,9 +259,9 @@ llama.cpp and GGUF
151259
for running inference on LLM models with CPUs or GPUs. llama.cpp uses a format
152260
called GGUF as its storage format.
153261

154-
We have llama.cpp conversions of all models with multiple quantizations levels.
262+
We have llama.cpp conversions of all Llama 2 and CodeLlama models with multiple quantization levels.
155263

156-
Before loading the modules load a module for the model weight you want to use.
264+
NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.
157265

158266
.. list-table::
159267
:header-rows: 1
@@ -164,32 +272,47 @@ Before loading the modules load a module for the model weight you want to use.
164272
* Module command to load
165273
* Description
166274

167-
* * Llama 2
275+
* * Llama 2
168276
* f16-2023-08-28
169-
* ``module load model-llama.cpp/f16-2023-08-28`` (after loading a Llama 2 model for some weight)
170-
* Half precision version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
277+
* ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
278+
* Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
171279

172-
* * Llama 2
280+
* * Llama 2
173281
* q4_0-2023-08-28
174-
* ``module load model-llama.cpp/q4_0-2023-08-28`` (after loading a Llama 2 model for some weight)
175-
* 4-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
282+
* ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
283+
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
176284

177285
* * Llama 2
178286
* q4_1-2023-08-28
179-
* ``module load model-llama.cpp/q4_1-2023-08-28`` (after loading a Llama 2 model for some weight)
180-
* 4-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
287+
* ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
288+
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
181289

182-
* * Llama 2
290+
* * Llama 2
291+
* q8_0-2023-08-28
292+
* ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
293+
* 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
294+
295+
* * CodeLlama
296+
* f16-2023-08-28
297+
* ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
298+
* Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
299+
300+
* * CodeLlama
301+
* q4_0-2023-08-28
302+
* ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
303+
* 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
304+
305+
* * CodeLlama
183306
* q8_0-2023-08-28
184-
* ``module load model-llama.cpp/q8_0-2023-08-28`` (after loading a Llama 2 model for some weight)
185-
* 8-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
307+
* ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
308+
* 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
186309

187310
Each module will set the following environment variables:
188311

189312
- ``MODEL_ROOT`` - Folder where model weights are stored.
190-
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF format.
313+
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.
191314

192-
This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
315+
This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
193316

194317
.. code-block:: python
195318
@@ -203,15 +326,7 @@ This Python code snippet is part of a 'Chat with Your PDF Documents' example, ut
203326
More examples
204327
------------------------------------------------------------
205328

206-
Running an interactive chat via a local API
329+
Starting a local API
207330
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
208-
With the predownloaded model weights, you are also able create an API endpoint locally and initiate an interactive chat interface directly from your shell or command line environment. For detailed setup insturctions, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/gpt4all-api>`__.
209-
210-
211-
Running llama with huggingface
212-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
213-
214-
215-
Running inference with LangChain
216-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
331+
With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.
217332

0 commit comments

Comments
 (0)