-
Notifications
You must be signed in to change notification settings - Fork 57
Update LLM section for winter kickstart #761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
42f786a
simplify instructions, discard llama checkpoint section (obsolete)
ruokolt 11b3fca
update
ruokolt 7271cd6
update
ruokolt d69e0ba
update
ruokolt 3b5a037
update
ruokolt 4b54543
update
ruokolt 4040dbe
update
ruokolt f516432
update
ruokolt File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,283 +1,102 @@ | ||
| LLMs | ||
| ==== | ||
|
|
||
| Large-language models are AI models that can understand and generate | ||
| text, primarily using transformer architectures. This page is about | ||
| running them on a HPC cluster. This requires | ||
| programming experience and knowledge of using the cluster | ||
| (:ref:`tutorials`), but allows maximum computational power for the | ||
| least cost. :doc:`Aalto RSE </rse/index>` maintains these models and | ||
| can provide help with using these, even to users who aren't | ||
| computational experts. | ||
| Large-language models (LLMs) are AI models that can understand and generate | ||
| text, primarily using transformer architectures. They are extensively used for tools and | ||
| tasks such as chatbots, translation, summarization, sentiment analysis, and question answering. | ||
|
|
||
| Because the size of model weights are typically very large and the interest in the | ||
| models is high, so we provide our users with pre-downloaded model weights in various formats, along with instructions on how to load these weights for inference purposes, retraining, and fine-tuning tasks. We also provide a dedicated python environment (run ``module load scicomp-llm-env`` to activate it) that has many commonly used python libraries installed for you to test the models quickly. | ||
| This page is about running LLMs on Aalto Triton. As a prerequisite, it is recommended to | ||
| get familiar with the basics of using the cluster, including running jobs and using Python (:ref:`tutorials`). | ||
|
|
||
| .. note:: | ||
|
|
||
| If at any point something doesn't work, you are unsure how to get started or proceed, do not hesitate to contact :doc:`the Aalto RSEs </rse/index>`. | ||
|
|
||
| You can visit us at :ref:`the daily Zoom help session at 13.00-14.00 <garage>`. | ||
|
|
||
|
|
||
| HuggingFace Models | ||
| ~~~~~~~~~~~~~~~~~~~ | ||
| The simplest way to use an open-source LLM(Large Language Model) is through the tools and pre-trained models hub from huggingface. | ||
| Huggingface is a popular platform for NLP(Natural Language Processing) tasks. It provides a user-friendly interface through the transformers library to load and run various pre-trained models. | ||
| Most open-source models from Huggingface are widely supported and integrated with the transformers library. | ||
| We are keeping our eyes on the latest models and have downloaded some of them for you. If you need any other models, please contact us. | ||
|
|
||
| Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models. | ||
| The simplest way to use pre-trained open-source LLMs is to access them through HuggingFace and to leverage their `🤗 Transformers Python library <https://huggingface.co/docs/transformers/en/index>`__. | ||
|
|
||
| HuggingFace provides a wide range of tools and pre-trained models, making it easy to integrate and utilize these models in your projects. | ||
|
|
||
| To access Huggingface models: | ||
| You can explore their offerings at `🤗 HuggingFace <https://huggingface.co/>`__. | ||
|
|
||
| .. tabs:: | ||
| .. note:: | ||
|
|
||
| .. group-tab:: slurm/shell script | ||
| We are keeping an eye on the latest models and have pre-downloaded some of them for you. If you need any other models, please contact :doc:`the Aalto RSEs </rse/index>`. | ||
|
|
||
| Load the module for huggingface models and setup environment variables: | ||
| Run command ``ls /scratch/shareddata/dldata/huggingface-hub-cache/hub`` to see the full list of all the available models. | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # this will set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache | ||
| module load model-huggingface/all | ||
| Below is an example of how to use the 🤗 Transformers `pipeline() <https://huggingface.co/docs/transformers/v4.49.0/en/main_classes/pipelines#transformers.pipeline>`__ to load a pre-trained model and use it for question answering. | ||
|
|
||
| # this will force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. | ||
| export TRANSFORMERS_OFFLINE=1 | ||
| export HF_HUB_OFFLINE=1 | ||
|
|
||
| python your_script.py | ||
| Example: Question Answering | ||
| --------------------------- | ||
|
|
||
| .. group-tab:: jupyter notebook | ||
| In the following sbatch script, we request computational resources, load the necessary modules, and run a Python script that uses a HuggingFace model for question answering. | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| #!/bin/bash | ||
| #SBATCH --time=00:30:00 | ||
| #SBATCH --cpus-per-task=4 | ||
| #SBATCH --mem=40GB | ||
| #SBATCH --gpus=1 | ||
| #SBATCH --output huggingface.%J.out | ||
| #SBATCH --error huggingface.%J.err | ||
|
|
||
| In jupyter notebook, one can set up all necessary environment variables directly: | ||
| # Set HF_HOME to /scratch/shareddata/dldata/huggingface-hub-cache | ||
| module load model-huggingface | ||
|
|
||
| .. code-block:: python | ||
| # Load Python environment to use HuggingFace Transformers | ||
| module load scicomp-llm-env | ||
|
|
||
| ## Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. | ||
| ## IMPORTANT: This must be executed before importing the transformers library | ||
| import os | ||
| os.environ['TRANSFORMERS_OFFLINE'] = '1' | ||
| os.environ['HF_HUB_OFFLINE'] = '1' | ||
| os.environ['HF_HOME']='/scratch/shareddata/dldata/huggingface-hub-cache' | ||
| # Force transformer to load model(s) from local hub instead of download and load model(s) from remote hub. | ||
| export TRANSFORMERS_OFFLINE=1 | ||
| export HF_HUB_OFFLINE=1 | ||
|
|
||
| python your_script.py | ||
|
|
||
| Here is a Python script using huggingface model. | ||
| The `your_script.py` Python script uses a HuggingFace model `mistralai/Mistral-7B-Instruct-v0.1` for conversations and instructions. | ||
|
Check failure on line 63 in triton/apps/llms.rst
|
||
|
|
||
| .. code-block:: python | ||
|
|
||
| from transformers import AutoModelForCausalLM, AutoTokenizer | ||
|
|
||
| tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1") | ||
| model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1") | ||
|
|
||
| prompt = "How many stars in the space?" | ||
|
|
||
| model_inputs = tokenizer([prompt], return_tensors="pt") | ||
| input_length = model_inputs.input_ids.shape[1] | ||
|
|
||
| generated_ids = model.generate(**model_inputs, max_new_tokens=20) | ||
| print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0]) | ||
|
|
||
| Raw model weights | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| We also downloaded the following raw llama model weights (PyTorch model checkpoints), and they are managed by the following modules. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 1 1 3 2 | ||
|
|
||
| * * Model type | ||
| * Model version | ||
| * Module command to load | ||
| * Description | ||
|
|
||
| * * Llama 2 | ||
| * Raw Data | ||
| * ``module load model-llama2/raw-data`` | ||
| * Raw weights of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 7b | ||
| * ``module load model-llama2/7b`` | ||
| * Raw weights of 7B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 7b-chat | ||
| * ``module load model-llama2/7b-chat`` | ||
| * Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 13b | ||
| * ``module load model-llama2/13b`` | ||
| * Raw weights of 13B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 13b-chat | ||
| * ``module load model-llama2/13b-chat`` | ||
| * Raw weights of 13B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 70b | ||
| * ``module load model-llama2/70b`` | ||
| * Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * Llama 2 | ||
| * 70b-chat | ||
| * ``module load model-llama2/70b-chat`` | ||
| * Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__. | ||
|
|
||
| * * CodeLlama | ||
| * Raw Data | ||
| * ``module load model-codellama/raw-data`` | ||
| * Raw weights of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. | ||
|
|
||
| * * CodeLlama | ||
| * 7b | ||
| * ``module load model-codellama/7b`` | ||
| * Raw weights of 7B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. | ||
|
|
||
| * * CodeLlama | ||
| * 7b-Python | ||
| * ``module load model-codellama/7b-python`` | ||
| * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. | ||
| * * CodeLlama | ||
| * 7b-Instruct | ||
| * ``module load model-codellama/7b-instruct`` | ||
| * Raw weights of 7B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. | ||
|
|
||
| * * CodeLlama | ||
| * 13b | ||
| * ``module load model-codellama/13b`` | ||
| * Raw weights of 13B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. | ||
|
|
||
| * * CodeLlama | ||
| * 13b-Python | ||
| * ``module load model-codellama/13b-python`` | ||
| * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. | ||
| * * CodeLlama | ||
| * 13b-Instruct | ||
| * ``module load model-codellama/13b-instruct`` | ||
| * Raw weights of 13B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. | ||
|
|
||
| * * CodeLlama | ||
| * 34b | ||
| * ``module load model-codellama/34b`` | ||
| * Raw weights of 34B parameter version of `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__. | ||
|
|
||
| * * CodeLlama | ||
| * 34b-Python | ||
| * ``module load model-codellama/34b-python`` | ||
| * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, specifically designed for Python. | ||
| * * CodeLlama | ||
| * 34b-Instruct | ||
| * ``module load model-codellama/34b-instruct`` | ||
| * Raw weights of 34B parameter version `CodeLlama <https://ai.meta.com/blog/code-llama-large-language-model-coding/>`__, designed for instruction following. | ||
|
|
||
| Each module will set the following environment variables: | ||
|
|
||
| - ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory. | ||
| - ``TOKENIZER_PATH`` - File path to the tokenizer.model. | ||
|
|
||
| Here is an example :doc:`slurm script </triton/tut/slurm>`, using the raw weights for batch inference. For detailed environment setting up, example prompts and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/batch-inference-llama2>`__. | ||
|
|
||
| .. code-block:: slurm | ||
|
|
||
| #!/bin/bash | ||
| #SBATCH --time=00:25:00 | ||
| #SBATCH --cpus-per-task=4 | ||
| #SBATCH --mem=20GB | ||
| #SBATCH --gpus=1 | ||
| #SBATCH --output llama2inference-gpu.%J.out | ||
| #SBATCH --error llama2inference-gpu.%J.err | ||
|
|
||
| # get access to the model weights | ||
| module load model-llama2/7b | ||
| echo $MODEL_ROOT | ||
| # Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b | ||
| echo $TOKENIZER_PATH | ||
| # Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model | ||
|
|
||
| # activate your conda environment | ||
| module load mamba | ||
| source activate llama2env | ||
|
|
||
| # run batch inference | ||
| torchrun --nproc_per_node 1 batch_inference.py \ | ||
| --prompts prompts.json \ | ||
| --ckpt_dir $MODEL_ROOT \ | ||
| --tokenizer_path $TOKENIZER_PATH \ | ||
| --max_seq_len 512 --max_batch_size 16 | ||
|
|
||
| llama.cpp and GGUF model weights | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| `llama.cpp <https://github.com/ggerganov/llama.cpp>`__ is another popular framework | ||
| for running inference on LLM models with CPUs or GPUs. It provides C++ implementations of many large language models. llama.cpp uses a format called GGUF as its storage format. | ||
| We have GGUF conversions of all Llama 2 and CodeLlama models with multiple quantization levels. | ||
| Please contact us if you need any other GGUF models. | ||
| NOTE: Before loading the following modules, one must first load a module for the raw model weights. For example, run ``module load model-codellama/34b`` first, and then run ``module load codellama.cpp/q8_0-2023-12-04`` to get the 8-bit integer version of CodeLlama weights in a .gguf file. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 1 1 3 2 | ||
|
|
||
| * * Model type | ||
| * Model version | ||
| * Module command to load | ||
| * Description | ||
|
|
||
| * * Llama 2 | ||
| * f16-2023-08-28 | ||
| * ``module load model-llama.cpp/f16-2023-12-04`` (after loading a Llama 2 model for some raw weights) | ||
| * Half precision version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * Llama 2 | ||
| * q4_0-2023-08-28 | ||
| * ``module load model-llama.cpp/q4_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) | ||
| * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * Llama 2 | ||
| * q4_1-2023-08-28 | ||
| * ``module load model-llama.cpp/q4_1-2023-12-04`` (after loading a Llama2 model for some raw weights) | ||
| * 4-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * Llama 2 | ||
| * q8_0-2023-08-28 | ||
| * ``module load model-llama.cpp/q8_0-2023-12-04`` (after loading a Llama 2 model for some raw weights) | ||
| * 8-bit integer version of Llama 2 weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * CodeLlama | ||
| * f16-2023-08-28 | ||
| * ``module load codellama.cpp/f16-2023-12-04`` (after loading a CodeLlama model for some raw weights) | ||
| * Half precision version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * CodeLlama | ||
| * q4_0-2023-08-28 | ||
| * ``module load codellama.cpp/q4_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) | ||
| * 4-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| * * CodeLlama | ||
| * q8_0-2023-08-28 | ||
| * ``module load codellama.cpp/q8_0-2023-12-04`` (after loading a CodeLlama model for some raw weights) | ||
| * 8-bit integer version of CodeLlama weights done with llama.cpp on 4th of Dec 2023. | ||
|
|
||
| Each module will set the following environment variables: | ||
|
|
||
| - ``MODEL_ROOT`` - Folder where model weights are stored. | ||
| - ``MODEL_WEIGHTS`` - Path to the model weights in GGUF file format. | ||
|
|
||
| This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and Python code, please check out `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/chat-with-pdf>`__. | ||
| NOTE: this example repo is mainly meant to run on CPUs, if you want to run on GPUs, you can checkout a branch "llamacpp-gpu" of this repo for details. | ||
| from transformers import pipeline | ||
| import torch | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import os | ||
| from langchain.llms import LlamaCpp | ||
| # Initialize pipeline | ||
| pipe = pipeline( | ||
| "text-generation", # Task type | ||
| model="mistralai/Mistral-7B-Instruct-v0.1", # Model name | ||
| device="cuda" if torch.cuda.is_available() else "cpu", # Use GPU if available | ||
| max_new_tokens=1000 | ||
| ) | ||
|
|
||
| model_path = os.environ.get('MODEL_WEIGHTS') | ||
| llm = LlamaCpp(model_path=model_path, verbose=False) | ||
| # Prepare prompts | ||
| prompts = ["Continue the following sequence: 1, 2, 3, 5, 8", "What is the meaning of life?"] | ||
|
|
||
| # Generate and print responses | ||
| responses = pipe(prompts) | ||
| print(responses) | ||
|
|
||
| You can look at the `model card <https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1>`__ for more information about the model. | ||
|
|
||
|
|
||
| Other Frameworks | ||
| ~~~~~~~~~~~~~~~~ | ||
|
|
||
| While HuggingFace provides a convenient way to access and use LLMs, there are other frameworks available for running LLMs, | ||
| such as `DeepSpeed <https://www.deepspeed.ai/tutorials/inference-tutorial/>`__ and `LangChain <https://python.langchain.com/docs/how_to/local_llms/>`__. | ||
|
|
||
| If you need assistance running LLMs in these or other frameworks, please contact :doc:`the Aalto RSEs </rse/index>`. | ||
|
|
||
|
|
||
| More examples | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| ~~~~~~~~~~~~~ | ||
|
|
||
| AaltoRSE has prepared a repository with miscellaneous examples of using LLMs on Triton. You can find it `here <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__. | ||
|
|
||
| Starting a local API | ||
| -------------------------- | ||
| With the pre-downloaded model weights, you are also able create an API endpoint locally. For detailed examples, you can checkout `this repo <https://github.com/AaltoSciComp/llm-examples/tree/main/>`__. | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the deletion, too old models. But maybe add a line: "If you need assistance running LLMs in formats other than HuggingFace's, please contact the RSEs.