|
| 1 | +LLMs |
| 2 | +==== |
| 3 | + |
| 4 | + |
| 5 | +Large-language models are AI models that can understand and generate text, |
| 6 | +primarily using transformer architectures. |
| 7 | + |
| 8 | +Because the model weights are typically very large and the interest in the |
| 9 | +models is high, so we provide our users with pre-downloaded model weights and instructions on how to load these weights for inference purposes or for retraining and fine-tuning the models. |
| 10 | + |
| 11 | + |
| 12 | +Pre-downloaded model weights |
| 13 | +---------------------------- |
| 14 | +Raw model weights |
| 15 | +~~~~~~~~~~~~~~~~~ |
| 16 | +We have downloaded the following raw model weights (PyTorch model checkpoints): |
| 17 | + |
| 18 | +.. list-table:: |
| 19 | + :header-rows: 1 |
| 20 | + :widths: 1 1 3 2 |
| 21 | + |
| 22 | + * * Model type |
| 23 | + * Model version |
| 24 | + * Module command to load |
| 25 | + * Description |
| 26 | + |
| 27 | + * * Llama 2 |
| 28 | + * Raw Data |
| 29 | + * ``module load model-llama2/raw-data`` |
| 30 | + * Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 31 | + |
| 32 | + * * Llama 2 |
| 33 | + * 7b |
| 34 | + * ``module load model-llama2/7b`` |
| 35 | + * Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 36 | + |
| 37 | + * * Llama 2 |
| 38 | + * 7b-chat |
| 39 | + * ``module load model-llama2/7b-chat`` |
| 40 | + * Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 41 | + |
| 42 | + * * Llama 2 |
| 43 | + * 13b |
| 44 | + * ``module load model-llama2/13b`` |
| 45 | + * Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 46 | + |
| 47 | + * * Llama 2 |
| 48 | + * 13b-chat |
| 49 | + * ``module load model-llama2/13b-chat`` |
| 50 | + * Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 51 | + |
| 52 | + * * Llama 2 |
| 53 | + * 70b |
| 54 | + * ``module load model-llama2/70b`` |
| 55 | + * Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 56 | + |
| 57 | + * * Llama 2 |
| 58 | + * 70b-chat |
| 59 | + * ``module load model-llama2/70b-chat`` |
| 60 | + * Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. |
| 61 | + |
| 62 | + * * CodeLlama |
| 63 | + * Raw Data |
| 64 | + * ``module load model-codellama/raw-data`` |
| 65 | + * Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
| 66 | + |
| 67 | + * * CodeLlama |
| 68 | + * 7b |
| 69 | + * ``module load model-codellama/7b`` |
| 70 | + * Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
| 71 | + |
| 72 | + * * CodeLlama |
| 73 | + * 7b-Python |
| 74 | + * ``module load model-codellama/7b-python`` |
| 75 | + * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
| 76 | + * * CodeLlama |
| 77 | + * 7b-Instruct |
| 78 | + * ``module load model-codellama/7b-instruct`` |
| 79 | + * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
| 80 | + |
| 81 | + * * CodeLlama |
| 82 | + * 13b |
| 83 | + * ``module load model-codellama/13b`` |
| 84 | + * Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
| 85 | + |
| 86 | + * * CodeLlama |
| 87 | + * 13b-Python |
| 88 | + * ``module load model-codellama/13b-python`` |
| 89 | + * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
| 90 | + * * CodeLlama |
| 91 | + * 13b-Instruct |
| 92 | + * ``module load model-codellama/13b-instruct`` |
| 93 | + * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
| 94 | + |
| 95 | + * * CodeLlama |
| 96 | + * 34b |
| 97 | + * ``module load model-codellama/34b`` |
| 98 | + * Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. |
| 99 | + |
| 100 | + * * CodeLlama |
| 101 | + * 34b-Python |
| 102 | + * ``module load model-codellama/34b-python`` |
| 103 | + * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. |
| 104 | + * * CodeLlama |
| 105 | + * 34b-Instruct |
| 106 | + * ``module load model-codellama/34b-instruct`` |
| 107 | + * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. |
| 108 | + |
| 109 | +Each module will set the following environment variables: |
| 110 | + |
| 111 | +- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory. |
| 112 | +- ``TOKENIZER_PATH`` - File path to the tokenizer.model. |
| 113 | + |
| 114 | +Here is an example `slurm <https://scicomp.aalto.fi/triton/tut/slurm/>`__, script using the raw weights to do batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__. |
| 115 | + |
| 116 | +.. code-block:: slurm |
| 117 | +
|
| 118 | + #!/bin/bash |
| 119 | + #SBATCH --time=00:25:00 |
| 120 | + #SBATCH --cpus_per_task=4 |
| 121 | + #SBATCH --mem=20GB |
| 122 | + #SBATCH --gres=gpu:1 |
| 123 | + #SBATCH --output=llama2inference-gpu.%J.out |
| 124 | + #SBATCH --error=llama2inference-gpu.%J.err |
| 125 | +
|
| 126 | + # get the model weights |
| 127 | + module load model-llama2/7b |
| 128 | + echo $MODEL_ROOT |
| 129 | + # Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b |
| 130 | + echo $TOKENIZER_PATH |
| 131 | + # Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model |
| 132 | + |
| 133 | + # activate your conda environment |
| 134 | + module load miniconda |
| 135 | + source activate llama2env |
| 136 | +
|
| 137 | + # run batch inference |
| 138 | + torchrun --nproc_per_node 1 batch_inference.py \ |
| 139 | + --prompts prompts.json \ |
| 140 | + --ckpt_dir $MODEL_ROOT \ |
| 141 | + --tokenizer_path $TOKENIZER_PATH \ |
| 142 | + --max_seq_len 512 --max_batch_size 16 |
| 143 | + |
| 144 | +Model weight conversions |
| 145 | +------------------------ |
| 146 | +Usually, models produced in research are stored as weights from PyTorch or other |
| 147 | +frameworks. As for inference, we also have models that are already converted to different formats. |
| 148 | + |
| 149 | + |
| 150 | +Huggingface Models |
| 151 | +~~~~~~~~~~~~~~~~~~~ |
| 152 | + |
| 153 | + |
| 154 | +Currently, we have the following Huggingface models stored on triton. Please contact us if you need any other models. |
| 155 | + |
| 156 | +.. list-table:: |
| 157 | + :header-rows: 1 |
| 158 | + :widths: 1 1 |
| 159 | + |
| 160 | + * * Model type |
| 161 | + * Huggingface model identifier |
| 162 | + |
| 163 | + * * Text Generation |
| 164 | + * mistralai/Mistral-7B-v0.1 |
| 165 | + |
| 166 | + * * Text Generation |
| 167 | + * mistralai/Mistral-7B-Instruct-v0.1 |
| 168 | + |
| 169 | + * * Text Generation |
| 170 | + * tiiuae/falcon-7b |
| 171 | + |
| 172 | + * * Text Generation |
| 173 | + * tiiuae/falcon-7b-instruct |
| 174 | + |
| 175 | + * * Text Generation |
| 176 | + * tiiuae/falcon-40b |
| 177 | + |
| 178 | + * * Text Generation |
| 179 | + * tiiuae/falcon-40b-instruct |
| 180 | + |
| 181 | + * * Text Generation |
| 182 | + * meta-llama/Llama-2-7b-hf |
| 183 | + |
| 184 | + * * Text Generation |
| 185 | + * meta-llama/Llama-2-13b-hf |
| 186 | + |
| 187 | + * * Text Generation |
| 188 | + * meta-llama/Llama-2-70b-hf |
| 189 | + |
| 190 | + * * Text Generation |
| 191 | + * codellama/CodeLlama-7b-hf |
| 192 | + |
| 193 | + * * Text Generation |
| 194 | + * codellama/CodeLlama-13b-hf |
| 195 | + |
| 196 | + * * Text Generation |
| 197 | + * codellama/CodeLlama-34b-hf |
| 198 | + |
| 199 | + * * Translation |
| 200 | + * Helsinki-NLP/opus-mt-en-fi |
| 201 | + |
| 202 | + * * Translation |
| 203 | + * Helsinki-NLP/opus-mt-fi-en |
| 204 | + |
| 205 | + * * Translation |
| 206 | + * t5-base |
| 207 | + |
| 208 | + * * Fill Mask |
| 209 | + * bert-base-uncased |
| 210 | + |
| 211 | + * * Fill Mask |
| 212 | + * bert-base-cased |
| 213 | + |
| 214 | + * * Fill Mask |
| 215 | + * distilbert-base-uncased |
| 216 | + |
| 217 | + * * Text to Speech |
| 218 | + * microsoft/speecht5_hifigan |
| 219 | + |
| 220 | + * * Text to Speech |
| 221 | + * facebook/hf-seamless-m4t-large |
| 222 | + |
| 223 | + * * Automatic Speech Recognition |
| 224 | + * openai/whisper-large-v3 |
| 225 | + |
| 226 | + * * Token Classification |
| 227 | + * dslim/bert-base-NER-uncased |
| 228 | + |
| 229 | + |
| 230 | + |
| 231 | +All Huggingface models can be loaded with ``module load model-huggingface/all``. |
| 232 | +Here is a Python script using huggingface model. |
| 233 | + |
| 234 | +.. code-block:: python |
| 235 | +
|
| 236 | + ## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. NOTE: this must be run before importing transformers. |
| 237 | + import os |
| 238 | + os.environ['TRANSFORMERS_OFFLINE'] = '1' |
| 239 | +
|
| 240 | + from transformers import AutoModelForCausalLM, AutoTokenizer |
| 241 | +
|
| 242 | + tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") |
| 243 | + model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") |
| 244 | +
|
| 245 | + prompt = "How many stars in the space?" |
| 246 | +
|
| 247 | + model_inputs = tokenizer([prompt], return_tensors="pt") |
| 248 | + input_length = model_inputs.input_ids.shape[1] |
| 249 | +
|
| 250 | + generated_ids = model.generate(**model_inputs, max_new_tokens=20) |
| 251 | + print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) |
| 252 | +
|
| 253 | +
|
| 254 | +
|
| 255 | +llama.cpp and GGUF |
| 256 | +~~~~~~~~~~~~~~~~~~~ |
| 257 | + |
| 258 | +`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is a popular framework |
| 259 | +for running inference on LLM models with CPUs or GPUs. llama.cpp uses a format |
| 260 | +called GGUF as its storage format. |
| 261 | + |
| 262 | +We have llama.cpp conversions of all Llama 2 and CodeLlama models with multiple quantization levels. |
| 263 | + |
| 264 | +NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file. |
| 265 | + |
| 266 | +.. list-table:: |
| 267 | + :header-rows: 1 |
| 268 | + :widths: 1 1 3 2 |
| 269 | + |
| 270 | + * * Model type |
| 271 | + * Model version |
| 272 | + * Module command to load |
| 273 | + * Description |
| 274 | + |
| 275 | + * * Llama 2 |
| 276 | + * f16-2023-08-28 |
| 277 | + * ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
| 278 | + * Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
| 279 | + |
| 280 | + * * Llama 2 |
| 281 | + * q4_0-2023-08-28 |
| 282 | + * ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
| 283 | + * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
| 284 | + |
| 285 | + * * Llama 2 |
| 286 | + * q4_1-2023-08-28 |
| 287 | + * ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights) |
| 288 | + * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
| 289 | + |
| 290 | + * * Llama 2 |
| 291 | + * q8_0-2023-08-28 |
| 292 | + * ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) |
| 293 | + * 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. |
| 294 | + |
| 295 | + * * CodeLlama |
| 296 | + * f16-2023-08-28 |
| 297 | + * ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
| 298 | + * Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
| 299 | + |
| 300 | + * * CodeLlama |
| 301 | + * q4_0-2023-08-28 |
| 302 | + * ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
| 303 | + * 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
| 304 | + |
| 305 | + * * CodeLlama |
| 306 | + * q8_0-2023-08-28 |
| 307 | + * ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) |
| 308 | + * 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. |
| 309 | + |
| 310 | +Each module will set the following environment variables: |
| 311 | + |
| 312 | +- ``MODEL_ROOT`` - Folder where model weights are stored. |
| 313 | +- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format. |
| 314 | + |
| 315 | +This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__. |
| 316 | + |
| 317 | +.. code-block:: python |
| 318 | + |
| 319 | + import os |
| 320 | + from langchain.llms import LlamaCpp |
| 321 | +
|
| 322 | + model_path = os.environ.get('MODEL_WEIGHTS') |
| 323 | + llm = LlamaCpp(model_path=model_path, verbose=False) |
| 324 | +
|
| 325 | +
|
| 326 | +More examples |
| 327 | +------------------------------------------------------------ |
| 328 | + |
| 329 | +Starting a local API |
| 330 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 331 | +With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__. |
| 332 | + |
0 commit comments