Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
319 changes: 69 additions & 250 deletions triton/apps/llms.rst
Original file line number Diff line number Diff line change
@@ -1,283 +1,102 @@
LLMs
====

Large-language models are AI models that can understand and generate
text, primarily using transformer architectures. This page is about
running them on a HPC cluster. This requires
programming experience and knowledge of using the cluster
(:ref:`tutorials`), but allows maximum computational power for the
least cost. :doc:`Aalto RSE </rse/index>` maintains these models and
can provide help with using these, even to users who aren't
computational experts.
Large-language models (LLMs) are AI models that can understand and generate
text, primarily using transformer architectures. They are extensively used for tools and
tasks such as chatbots, translation, summarization, sentiment analysis, and question answering.

Because the size of model weights are typically very large and the interest in the
models is high, so we provide our users with pre-downloaded model weights in various formats, along with instructions on how to load these weights for inference purposes, retraining, and fine-tuning tasks. We also provide a dedicated python environment (run ``module load scicomp-llm-env`` to activate it) that has many commonly used python libraries installed for you to test the models quickly.
This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to
get familiar with the basics of using the cluster, including running jobs and using Python (:ref:`tutorials`).

.. note::

If at any point something doesn't work, you are unsure how to get started or proceed, do not hesitate to contact :doc:`the Aalto RSEs </rse/index>`.

You can visit us at :ref:`the daily Zoom help session at 13.00-14.00 <garage>`.


HuggingFace Models
~~~~~~~~~~~~~~~~~~~
The simplest way to use an open-source LLM(Large Language Model) is through the tools and pre-trained models hub from huggingface.
Huggingface is a popular platform for NLP(Natural Language Processing) tasks. It provides a user-friendly interface through the transformers library to load and run various pre-trained models.
Most open-source models from Huggingface are widely supported and integrated with the transformers library.
We are keeping our eyes on the latest models and have downloaded some of them for you. If you need any other models, please contact us.

Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.
The simplest way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their `🤗 Transformers Python library <https://huggingface.co/docs/transformers/en/index>`__.

HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects.

To access Huggingface models:
You can explore their offerings at `🤗 HuggingFace <https://huggingface.co/>`__.

.. tabs::
.. note::

.. group-tab:: slurm/shell script
We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact :doc:`the Aalto RSEs </rse/index>`.

Load the module for huggingface models and setup environment variables:
Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models.

.. code-block:: bash

# this will set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
module load model-huggingface/all
Below is an example of how to use the 🤗 Transformers `pipeline() <https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline>`__ to load a pre-trained model and use it for question answering.

# this will force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1

python your_script.py
Example: Question Answering
---------------------------

.. group-tab:: jupyter notebook
In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering.

.. code-block:: bash

#!/bin/bash
#SBATCH --time=00:30:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=40GB
#SBATCH --gpus=1
#SBATCH --output huggingface.%J.out
#SBATCH --error huggingface.%J.err

In jupyter notebook, one can set up all necessary environment variables directly:
# Set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache
module load model-huggingface

.. code-block:: python
# Load Python environment to use HuggingFace Transformers
module load scicomp-llm-env

## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
## IMPORTANT: This must be executed before importing the transformers library
import os
os.environ['TRANSFORMERS_OFFLINE'] = '1'
os.environ['HF_HUB_OFFLINE'] = '1'
os.environ['HF_HOME']='/scratch/shareddata/dldata/huggingface-hub-cache'
# Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub.
export TRANSFORMERS_OFFLINE=1
export HF_HUB_OFFLINE=1

python your_script.py

Here is a Python script using huggingface model.
The `your_script.py` Python script uses a HuggingFace model `mistralai/Mistral-7B-Instruct-v0.1` for conversations and instructions.

Check failure on line 63 in triton/apps/llms.rst

View workflow job for this annotation

GitHub Actions / check-warnings (3.12)

'any' reference target not found: your_script.py

Check failure on line 63 in triton/apps/llms.rst

View workflow job for this annotation

GitHub Actions / check-warnings (3.12)

'any' reference target not found: mistralai/Mistral-7B-Instruct-v0.1

.. code-block:: python

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")

prompt = "How many stars in the space?"

model_inputs = tokenizer([prompt], return_tensors="pt")
input_length = model_inputs.input_ids.shape[1]

generated_ids = model.generate(**model_inputs, max_new_tokens=20)
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])

Raw model weights
~~~~~~~~~~~~~~~~~~~~~~~~
We also downloaded the following raw llama model weights (PyTorch model checkpoints), and they are managed by the following modules.

.. list-table::
:header-rows: 1
:widths: 1 1 3 2

* * Model type
* Model version
* Module command to load
* Description

* * Llama 2
* Raw Data
* ``module load model-llama2/raw-data``
* Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 7b
* ``module load model-llama2/7b``
* Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 7b-chat
* ``module load model-llama2/7b-chat``
* Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 13b
* ``module load model-llama2/13b``
* Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 13b-chat
* ``module load model-llama2/13b-chat``
* Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 70b
* ``module load model-llama2/70b``
* Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * Llama 2
* 70b-chat
* ``module load model-llama2/70b-chat``
* Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.

* * CodeLlama
* Raw Data
* ``module load model-codellama/raw-data``
* Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.

* * CodeLlama
* 7b
* ``module load model-codellama/7b``
* Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.

* * CodeLlama
* 7b-Python
* ``module load model-codellama/7b-python``
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
* * CodeLlama
* 7b-Instruct
* ``module load model-codellama/7b-instruct``
* Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.

* * CodeLlama
* 13b
* ``module load model-codellama/13b``
* Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.

* * CodeLlama
* 13b-Python
* ``module load model-codellama/13b-python``
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
* * CodeLlama
* 13b-Instruct
* ``module load model-codellama/13b-instruct``
* Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.

* * CodeLlama
* 34b
* ``module load model-codellama/34b``
* Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__.

* * CodeLlama
* 34b-Python
* ``module load model-codellama/34b-python``
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python.
* * CodeLlama
* 34b-Instruct
* ``module load model-codellama/34b-instruct``
* Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following.

Each module will set the following environment variables:

- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
- ``TOKENIZER_PATH`` - File path to the tokenizer.model.

Here is an example :doc:`slurm script </triton/tut/slurm>`, using the raw weights for batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__.

.. code-block:: slurm

#!/bin/bash
#SBATCH --time=00:25:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=20GB
#SBATCH --gpus=1
#SBATCH --output llama2inference-gpu.%J.out
#SBATCH --error llama2inference-gpu.%J.err

# get access to the model weights
module load model-llama2/7b
echo $MODEL_ROOT
# Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
echo $TOKENIZER_PATH
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model

# activate your conda environment
module load mamba
source activate llama2env

# run batch inference
torchrun --nproc_per_node 1 batch_inference.py \
--prompts prompts.json \
--ckpt_dir $MODEL_ROOT \
--tokenizer_path $TOKENIZER_PATH \
--max_seq_len 512 --max_batch_size 16

llama.cpp and GGUF model weights
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is another popular framework
for running inference on LLM models with CPUs or GPUs. It provides C++ implementations of many large language models. llama.cpp uses a format called GGUF as its storage format.
We have GGUF conversions of all Llama 2 and CodeLlama models with multiple quantization levels.
Please contact us if you need any other GGUF models.
NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file.

.. list-table::
:header-rows: 1
:widths: 1 1 3 2

* * Model type
* Model version
* Module command to load
* Description

* * Llama 2
* f16-2023-08-28
* ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights)
* Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

* * Llama 2
* q4_0-2023-08-28
* ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

* * Llama 2
* q4_1-2023-08-28
* ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights)
* 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

* * Llama 2
* q8_0-2023-08-28
* ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights)
* 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023.

* * CodeLlama
* f16-2023-08-28
* ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights)
* Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

* * CodeLlama
* q4_0-2023-08-28
* ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
* 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

* * CodeLlama
* q8_0-2023-08-28
* ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights)
* 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023.

Each module will set the following environment variables:

- ``MODEL_ROOT`` - Folder where model weights are stored.
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format.

This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__.
NOTE: this example repo is mainly meant to run on CPUs, if you want to run on GPUs, you can checkout a branch "llamacpp-gpu" of this repo for details.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the deletion, too old models. But maybe add a line: "If you need assistance running LLMs in formats other than HuggingFace's, please contact the RSEs.

from transformers import pipeline
import torch

.. code-block:: python

import os
from langchain.llms import LlamaCpp
# Initialize pipeline
pipe = pipeline(
"text-generation", # Task type
model="mistralai/Mistral-7B-Instruct-v0.1", # Model name
device="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available
max_new_tokens=1000
)

model_path = os.environ.get('MODEL_WEIGHTS')
llm = LlamaCpp(model_path=model_path, verbose=False)
# Prepare prompts
prompts = ["Continue the following sequence: 1, 2, 3, 5, 8", "What is the meaning of life?"]

# Generate and print responses
responses = pipe(prompts)
print(responses)

You can look at the `model card <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1>`__ for more information about the model.


Other Frameworks
~~~~~~~~~~~~~~~~

While HuggingFace provides a convenient way to access and use LLMs, there are other frameworks available for running LLMs,
such as `DeepSpeed <https://www.deepspeed.ai/tutorials/inference-tutorial/>`__ and `LangChain <https://python.langchain.com/docs/how_to/local_llms/>`__.

If you need assistance running LLMs in these or other frameworks, please contact :doc:`the Aalto RSEs </rse/index>`.


More examples
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~

AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it `here <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.

Starting a local API
--------------------------
With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__.

Loading