Adding CodeLlama and Huggingface Models

Yu Tian · Yu Tian · commit bf72254aeaf1 · 2023-12-09T15:49:11.000+02:00
diff --git a/triton/apps/llms.rst b/triton/apps/llms.rst
@@ -2,21 +2,18 @@ LLMs
 ====
 
 
-.. highlight:: bash
-
-Large-language models are AI models that can understand and generate text
-using transformer architectures.
+Large-language models are AI models that can understand and generate text, 
+primarily using transformer architectures.
 
 Because the model weights are typically very large and the interest in the
-models is high, we provide our users pre-downloaded model weights and
-instructions on how to run inference and training on the models.
+models is high, so we provide our users with pre-downloaded model weights and instructions on how to load these weights for inference purposes or for retraining and fine-tuning the models.
 
 
 Pre-downloaded model weights
 ----------------------------
 Raw model weights
 ~~~~~~~~~~~~~~~~~
-We have downloaded the following models weights (PyTorch model checkpoint directories):
+We have downloaded the following raw model weights (PyTorch model checkpoints):
 
 .. list-table::
   :header-rows: 1
@@ -28,7 +25,7 @@ We have downloaded the following models weights (PyTorch model checkpoint direct
     * Description
 
   * * Llama 2
-    * Raw data
+    * Raw Data
     * ``module load model-llama2/raw-data``
     * Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.
 
@@ -62,12 +59,59 @@ We have downloaded the following models weights (PyTorch model checkpoint direct
     * ``module load model-llama2/70b-chat``
     * Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
 
+  * * CodeLlama
+    * Raw Data
+    * ``module load model-codellama/raw-data``
+    * Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
+
+  * * CodeLlama
+    * 7b
+    * ``module load model-codellama/7b``
+    * Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
+
+  * * CodeLlama
+    * 7b-Python
+    * ``module load model-codellama/7b-python``
+    * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
+  * * CodeLlama
+    * 7b-Instruct
+    * ``module load model-codellama/7b-instruct``
+    * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
+
+  * * CodeLlama
+    * 13b
+    * ``module load model-codellama/13b``
+    * Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
+
+  * * CodeLlama
+    * 13b-Python
+    * ``module load model-codellama/13b-python``
+    * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
+  * * CodeLlama
+    * 13b-Instruct
+    * ``module load model-codellama/13b-instruct``
+    * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
+
+  * * CodeLlama
+    * 34b
+    * ``module load model-codellama/34b``
+    * Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
+
+  * * CodeLlama
+    * 34b-Python
+    * ``module load model-codellama/34b-python``
+    * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
+  * * CodeLlama
+    * 34b-Instruct
+    * ``module load model-codellama/34b-instruct``
+    * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
+
 Each module will set the following environment variables:
 
 - ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
 - ``TOKENIZER_PATH`` - File path to the tokenizer.model. 
 
-Here is an example slurm script using the raw weights to do batch inference. For detailed environment setting up, example prompts and python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
+Here is an example `slurm <https://scicomp.aalto.fi/triton/tut/slurm/>`__, script using the raw weights to do batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
 
 .. code-block:: slurm
 
@@ -86,7 +130,7 @@ Here is an example slurm script using the raw weights to do batch inference. For
   echo $TOKENIZER_PATH
   # Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
   
-  # activate conda environment
+  # activate your conda environment
   module load miniconda
   source activate llama2env
 
@@ -99,46 +143,110 @@ Here is an example slurm script using the raw weights to do batch inference. For
      
 Model weight conversions
 ------------------------
-Usually models produced in research are stored as weights from PyTorch or other
+Usually, models produced in research are stored as weights from PyTorch or other
 frameworks. As for inference, we also have models that are already converted to different formats.
 
 
 Huggingface Models
 ~~~~~~~~~~~~~~~~~~~
 
 
-We have the following Huggingface models stored:
+Currently, we have the following Huggingface models stored on triton. Please contact us if you need any other models.
 
 .. list-table::
   :header-rows: 1
-  :widths: 1 1 3 2
+  :widths: 1 1
 
   * * Model type
-    * Model version
-    * Module command to load
-    * Description
+    * Huggingface model identifier
 
-  * * Llama 2
-    * 
-    * Module command to load
-    * Description
+  * * Text Generation
+    * mistralai/Mistral-7B-v0.1
+
+  * * Text Generation
+    * mistralai/Mistral-7B-Instruct-v0.1
 
-All Huggingface models can be loaded with:  ``module load model-huggingface/all``,
-Here is a python script using huggingface model.
+  * * Text Generation
+    * tiiuae/falcon-7b
+
+  * * Text Generation
+    * tiiuae/falcon-7b-instruct
+
+  * * Text Generation
+    * tiiuae/falcon-40b
+
+  * * Text Generation
+    * tiiuae/falcon-40b-instruct
+  
+  * * Text Generation
+    * meta-llama/Llama-2-7b-hf
+
+  * * Text Generation
+    * meta-llama/Llama-2-13b-hf
+
+  * * Text Generation
+    * meta-llama/Llama-2-70b-hf
+
+  * * Text Generation
+    * codellama/CodeLlama-7b-hf
+
+  * * Text Generation
+    * codellama/CodeLlama-13b-hf
+
+  * * Text Generation
+    * codellama/CodeLlama-34b-hf
+
+  * * Translation
+    * Helsinki-NLP/opus-mt-en-fi
+
+  * * Translation
+    * Helsinki-NLP/opus-mt-fi-en
+
+  * * Translation
+    * t5-base
+  
+  * * Fill Mask
+    * bert-base-uncased
+
+  * * Fill Mask
+    * bert-base-cased
+
+  * * Fill Mask
+    * distilbert-base-uncased
+
+  * * Text to Speech
+    * microsoft/speecht5_hifigan
+  
+  * * Text to Speech
+    * facebook/hf-seamless-m4t-large
+
+  * * Automatic Speech Recognition
+    * openai/whisper-large-v3
+
+  * * Token Classification
+    * dslim/bert-base-NER-uncased
+
+
+
+All Huggingface models can be loaded with  ``module load model-huggingface/all``.
+Here is a Python script using huggingface model.
 
 .. code-block:: python
 
-  #force transformer to use local hub instead of download from remote hub
+  ## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. NOTE: this must be run before importing transformers.
   import os
   os.environ['TRANSFORMERS_OFFLINE'] = '1'
 
   from transformers import AutoModelForCausalLM, AutoTokenizer
 
   tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
   model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
+
   prompt = "How many stars in the space?"
+
   model_inputs = tokenizer([prompt], return_tensors="pt")
   input_length = model_inputs.input_ids.shape[1]
+
   generated_ids = model.generate(**model_inputs, max_new_tokens=20)
   print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
 
@@ -151,9 +259,9 @@ llama.cpp and GGUF
 for running inference on LLM models with CPUs or GPUs. llama.cpp uses a format
 called GGUF as its storage format.
 
-We have llama.cpp conversions of all models with multiple quantizations levels.
+We have llama.cpp conversions of all Llama 2 and CodeLlama models with multiple quantization levels.
 
-Before loading the modules load a module for the model weight you want to use.
+NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.
 
 .. list-table::
   :header-rows: 1
@@ -164,32 +272,47 @@ Before loading the modules load a module for the model weight you want to use.
     * Module command to load
     * Description
 
-  * * Llama 2
+  * * Llama 2 
     * f16-2023-08-28
-    * ``module load model-llama.cpp/f16-2023-08-28`` (after loading a Llama 2 model for some weight)
-    * Half precision version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
+    * ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
+    * Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
 
-  * * Llama 2
+  * * Llama 2 
     * q4_0-2023-08-28
-    * ``module load model-llama.cpp/q4_0-2023-08-28`` (after loading a Llama 2 model for some weight)
-    * 4-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
+    * ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
+    * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
 
   * * Llama 2
     * q4_1-2023-08-28
-    * ``module load model-llama.cpp/q4_1-2023-08-28`` (after loading a Llama 2 model for some weight)
-    * 4-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
+    * ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
+    * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
 
-  * * Llama 2
+  * * Llama 2 
+    * q8_0-2023-08-28
+    * ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
+    * 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
+
+  * * CodeLlama
+    * f16-2023-08-28
+    * ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
+    * Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
+
+  * * CodeLlama
+    * q4_0-2023-08-28
+    * ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
+    * 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
+
+  * * CodeLlama
     * q8_0-2023-08-28
-    * ``module load model-llama.cpp/q8_0-2023-08-28`` (after loading a Llama 2 model for some weight)
-    * 8-bit integer version of Llama 2 weights done with llama.cpp on 28th of Aug 2023.
+    * ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
+    * 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
 
 Each module will set the following environment variables:
 
 - ``MODEL_ROOT`` - Folder where model weights are stored.
-- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF format.
+- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.
 
-This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
+This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
 
 .. code-block:: python
   
@@ -203,15 +326,7 @@ This Python code snippet is part of a 'Chat with Your PDF Documents' example, ut
 More examples
 ------------------------------------------------------------
 
-Running an interactive chat via a local API
+Starting a local API
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-With the predownloaded model weights, you are also able create an API endpoint locally and initiate an interactive chat interface directly from your shell or command line environment. For detailed setup insturctions, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/gpt4all-api>`__.
-
-
-Running llama with huggingface
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-
-Running inference with LangChain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.