AaltoSciComp · ruokolt · Feb 25, 2025 · Feb 21, 2025 · Feb 21, 2025 · Feb 24, 2025
diff --git a/triton/apps/llms.rst b/triton/apps/llms.rst
@@ -1,283 +1,102 @@
 LLMs
 ====
 
-Large-language models are AI models that can understand and generate
-text, primarily using transformer architectures.  This page is about
-running them on a HPC cluster.  This requires
-programming experience and knowledge of using the cluster
-(:ref:`tutorials`), but allows maximum computational power for the
-least cost.  :doc:`Aalto RSE </rse/index>` maintains these models and
-can provide help with using these, even to users who aren't
-computational experts.
+Large-language models (LLMs) are AI models that can understand and generate
+text, primarily using transformer architectures. They are extensively used for tools and 
+tasks such as chatbots, translation, summarization, sentiment analysis, and question answering.
 
-Because the size of model weights are typically very large and the interest in the
-models is high, so we provide our users with pre-downloaded model weights in various formats, along with instructions on how to load these weights for inference purposes, retraining, and fine-tuning tasks. We also provide a dedicated python environment (run ``module load scicomp-llm-env`` to activate it) that has many commonly used python libraries installed for you to test the models quickly.
+This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to
+get familiar with the basics of using the cluster, including running jobs and using Python (:ref:`tutorials`).
 
+.. note::
+
+    If at any point something doesn't work, you are unsure how to get started or proceed, do not hesitate to contact :doc:`the Aalto RSEs </rse/index>`. 
+
+    You can visit us at :ref:`the daily Zoom help session at 13.00-14.00 <garage>`.
+
 
 HuggingFace Models
 ~~~~~~~~~~~~~~~~~~~
-The simplest way to use an open-source LLM(Large Language Model) is through the tools and pre-trained models hub from huggingface.
-Huggingface is a popular platform for NLP(Natural Language Processing) tasks. It provides a user-friendly interface through the transformers library to load and run various pre-trained models.
-Most open-source models from Huggingface are widely supported and integrated with the transformers library.
-We are keeping our eyes on the latest models and have downloaded some of them for you. If you need any other models, please contact us.
 
-Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.
+The simplest way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their `🤗 Transformers Python library <https://huggingface.co/docs/transformers/en/index>`__. 
 
+HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects.
 
-To access Huggingface models: 
+You can explore their offerings at `🤗 HuggingFace <https://huggingface.co/>`__.
 
-.. tabs::
+.. note::
 
-  .. group-tab:: slurm/shell script
+  We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact :doc:`the Aalto RSEs </rse/index>`.
 
-    Load the module for huggingface models and setup environment variables:
+  Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.
 
-    .. code-block:: bash
-
-      # this will set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
-      module load model-huggingface/all
+Below is an example of how to use the 🤗 Transformers `pipeline() <https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline>`__ to load a pre-trained model and use it for question answering.
 
-      # this will force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. 
-      export TRANSFORMERS_OFFLINE=1
-      export HF_HUB_OFFLINE=1
 
-      python your_script.py
+Example: Question Answering
+---------------------------
 
-  .. group-tab:: jupyter notebook
+In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering.
+
+  .. code-block:: bash
+
+    #!/bin/bash
+    #SBATCH --time=00:30:00
+    #SBATCH --cpus-per-task=4
+    #SBATCH --mem=40GB
+    #SBATCH --gpus=1
+    #SBATCH --output huggingface.%J.out
+    #SBATCH --error huggingface.%J.err
 
-    In jupyter notebook, one can set up all necessary environment variables directly:
+    # Set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
+    module load model-huggingface
 
-    .. code-block:: python
+    # Load Python environment to use HuggingFace Transformers
+    module load scicomp-llm-env
 
-     ## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. 
-     ## IMPORTANT: This must be executed before importing the transformers library
-      import os
-      os.environ['TRANSFORMERS_OFFLINE'] = '1'
-      os.environ['HF_HUB_OFFLINE'] = '1'
-      os.environ['HF_HOME']='/scratch/shareddata/dldata/huggingface-hub-cache'
+    # Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. 
+    export TRANSFORMERS_OFFLINE=1
+    export HF_HUB_OFFLINE=1
 
+    python your_script.py
 
-Here is a Python script using huggingface model.
+The `your_script.py` Python script uses a HuggingFace model `mistralai/Mistral-7B-Instruct-v0.1` for conversations and instructions.
 
 .. code-block:: python
 
-  from transformers import AutoModelForCausalLM, AutoTokenizer
-
-  tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
-  model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
-
-  prompt = "How many stars in the space?"
-
-  model_inputs = tokenizer([prompt], return_tensors="pt")
-  input_length = model_inputs.input_ids.shape[1]
-
-  generated_ids = model.generate(**model_inputs, max_new_tokens=20)
-  print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
-
-Raw model weights
-~~~~~~~~~~~~~~~~~~~~~~~~
-We also downloaded the following raw llama model weights (PyTorch model checkpoints), and they are managed by the following modules. 
-
-.. list-table::
-  :header-rows: 1
-  :widths: 1 1 3 2
-
-  * * Model type
-    * Model version
-    * Module command to load
-    * Description
-
-  * * Llama 2
-    * Raw Data
-    * ``module load model-llama2/raw-data``
-    * Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 7b
-    * ``module load model-llama2/7b``
-    * Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 7b-chat
-    * ``module load model-llama2/7b-chat``
-    * Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 13b
-    * ``module load model-llama2/13b``
-    * Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 13b-chat
-    * ``module load model-llama2/13b-chat``
-    * Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 70b
-    * ``module load model-llama2/70b``
-    * Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * Llama 2
-    * 70b-chat
-    * ``module load model-llama2/70b-chat``
-    * Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
-
-  * * CodeLlama
-    * Raw Data
-    * ``module load model-codellama/raw-data``
-    * Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
-
-  * * CodeLlama
-    * 7b
-    * ``module load model-codellama/7b``
-    * Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
-
-  * * CodeLlama
-    * 7b-Python
-    * ``module load model-codellama/7b-python``
-    * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
-  * * CodeLlama
-    * 7b-Instruct
-    * ``module load model-codellama/7b-instruct``
-    * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
-
-  * * CodeLlama
-    * 13b
-    * ``module load model-codellama/13b``
-    * Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
-
-  * * CodeLlama
-    * 13b-Python
-    * ``module load model-codellama/13b-python``
-    * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
-  * * CodeLlama
-    * 13b-Instruct
-    * ``module load model-codellama/13b-instruct``
-    * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
-
-  * * CodeLlama
-    * 34b
-    * ``module load model-codellama/34b``
-    * Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.
-
-  * * CodeLlama
-    * 34b-Python
-    * ``module load model-codellama/34b-python``
-    * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
-  * * CodeLlama
-    * 34b-Instruct
-    * ``module load model-codellama/34b-instruct``
-    * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.
-
-Each module will set the following environment variables:
-
-- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
-- ``TOKENIZER_PATH`` - File path to the tokenizer.model. 
-
-Here is an example :doc:`slurm script </triton/tut/slurm>`, using the raw weights for batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.
-
-.. code-block:: slurm
-
-  #!/bin/bash
-  #SBATCH --time=00:25:00
-  #SBATCH --cpus-per-task=4
-  #SBATCH --mem=20GB
-  #SBATCH --gpus=1
-  #SBATCH --output llama2inference-gpu.%J.out
-  #SBATCH --error llama2inference-gpu.%J.err
-
-  # get access to the model weights
-  module load model-llama2/7b
-  echo $MODEL_ROOT
-  # Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
-  echo $TOKENIZER_PATH
-  # Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
-
-  # activate your conda environment
-  module load mamba
-  source activate llama2env
-
-  # run batch inference
-  torchrun --nproc_per_node 1 batch_inference.py \
-    --prompts prompts.json \
-    --ckpt_dir $MODEL_ROOT \
-    --tokenizer_path $TOKENIZER_PATH \
-    --max_seq_len 512 --max_batch_size 16
-
-llama.cpp and GGUF model weights
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is another popular framework
-for running inference on LLM models with CPUs or GPUs. It provides C++ implementations of many large language models. llama.cpp uses a format called GGUF as its storage format.
-We have GGUF conversions of all Llama 2 and CodeLlama models with multiple quantization levels. 
-Please contact us if you need any other GGUF models. 
-NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.
-
-.. list-table::
-  :header-rows: 1
-  :widths: 1 1 3 2
-
-  * * Model type
-    * Model version
-    * Module command to load
-    * Description
-
-  * * Llama 2 
-    * f16-2023-08-28
-    * ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
-    * Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
-
-  * * Llama 2 
-    * q4_0-2023-08-28
-    * ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
-    * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
-
-  * * Llama 2
-    * q4_1-2023-08-28
-    * ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
-    * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
-
-  * * Llama 2 
-    * q8_0-2023-08-28
-    * ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
-    * 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.
-
-  * * CodeLlama
-    * f16-2023-08-28
-    * ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
-    * Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
-
-  * * CodeLlama
-    * q4_0-2023-08-28
-    * ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
-    * 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
-
-  * * CodeLlama
-    * q8_0-2023-08-28
-    * ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
-    * 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.
-
-Each module will set the following environment variables:
-
-- ``MODEL_ROOT`` - Folder where model weights are stored.
-- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.
-
-This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
-NOTE: this example repo is mainly meant to run on CPUs, if you want to run on GPUs, you can checkout a branch "llamacpp-gpu" of this repo for details. 
+  from transformers import pipeline
+  import torch
 
-.. code-block:: python
-
-  import os
-  from langchain.llms import LlamaCpp
+  # Initialize pipeline
+  pipe = pipeline( 
+    "text-generation", # Task type 
+    model="mistralai/Mistral-7B-Instruct-v0.1", # Model name 
+    device="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available 
+    max_new_tokens=1000 
+  ) 
 
-  model_path = os.environ.get('MODEL_WEIGHTS')
-  llm = LlamaCpp(model_path=model_path, verbose=False)
+  # Prepare prompts
+  prompts = ["Continue the following sequence: 1, 2, 3, 5, 8", "What is the meaning of life?"]
+
+  # Generate and print responses
+  responses = pipe(prompts) 
+  print(responses)
+
+You can look at the `model card <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1>`__ for more information about the model.
+
+
+Other Frameworks
+~~~~~~~~~~~~~~~~
+
+While HuggingFace provides a convenient way to access and use LLMs, there are other frameworks available for running LLMs, 
+such as `DeepSpeed <https://www.deepspeed.ai/tutorials/inference-tutorial/>`__ and `LangChain <https://python.langchain.com/docs/how_to/local_llms/>`__.
+
+If you need assistance running LLMs in these or other frameworks, please contact :doc:`the Aalto RSEs </rse/index>`.
 
 
 More examples
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+~~~~~~~~~~~~~
+
+AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it `here <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.
 
-Starting a local API
---------------------------
-With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.