Skip to content

Commit 6f63e7c

Browse files
author
Yu Tian
committed
add example scripts for each model file format
1 parent b318466 commit 6f63e7c

File tree

1 file changed

+65
-8
lines changed

1 file changed

+65
-8
lines changed

triton/apps/llms.rst

Lines changed: 65 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ instructions on how to run inference and training on the models.
1515
Pre-downloaded model weights
1616
****************************
1717

18-
We have downloaded following models weights:
18+
We have downloaded the following models weights (PyTorch model checkpoint directories):
1919

2020
.. list-table::
2121
:header-rows: 1
@@ -38,7 +38,7 @@ We have downloaded following models weights:
3838

3939
* * Llama 2
4040
* 7b-chat
41-
* ``module load model-llama2/7b``
41+
* `module load model-llama2/7b-chat`
4242
* Raw weights of 7B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
4343

4444
* * Llama 2
@@ -53,19 +53,49 @@ We have downloaded following models weights:
5353

5454
* * Llama 2
5555
* 70b
56-
* ``module load model-llama2/13b``
56+
* `module load model-llama2/70b`
5757
* Raw weights of 70B parameter version of `Llama 2 <https://ai.meta.com/llama/>`__.
5858

5959
* * Llama 2
6060
* 70b-chat
61-
* ``module load model-llama2/13b-chat``
61+
* `module load model-llama2/70b-chat`
6262
* Raw weights of 70B parameter chat optimized version of `Llama 2 <https://ai.meta.com/llama/>`__.
6363

6464
Each module will set the following environment variables:
6565

66-
- ``MODEL_ROOT`` - Folder where model weights are stored.
67-
68-
66+
- ``MODEL_ROOT`` - Folder where model weights are stored, i.e., PyTorch model checkpoint directory.
67+
- ``TOKENIZER_PATH`` - File path to the tokenizer.model.
68+
69+
Here is an example slurm script using the raw weights to do batch inference. For detailed environment setting up, example prompts and python code, please check out `this repo <>`__.
70+
71+
.. code-block:: slurm
72+
73+
#!/bin/bash
74+
#SBATCH --time=00:25:00
75+
#SBATCH --cpus_per_task=4
76+
#SBATCH --mem=20GB
77+
#SBATCH --gres=gpu:1
78+
#SBATCH --output=llama2inference-gpu.%J.out
79+
#SBATCH --error=llama2inference-gpu.%J.err
80+
81+
# get the model weights
82+
module load model-llama2/7b
83+
echo $MODEL_ROOT
84+
# Expect output: /scratch/shareddata/dldata/llama-2/llama-2-7b
85+
echo $TOKENIZER_PATH
86+
# Expect output: /scratch/shareddata/dldata/llama-2/tokenizer.model
87+
88+
# activate conda environment
89+
module load miniconda
90+
source activate llama2env
91+
92+
# run batch inference
93+
torchrun --nproc_per_node 1 batch_inference.py \
94+
--prompts prompts.json \
95+
--ckpt_dir $MODEL_ROOT \
96+
--tokenizer_path $TOKENIZER_PATH \
97+
--max_seq_len 512 --max_batch_size 16
98+
6999
Model weight conversions
70100
************************
71101

@@ -78,7 +108,7 @@ We also have models that are already converted to different formats.
78108
Huggingface
79109
-----------
80110

81-
All Huggingface models can be loaded with: ``module load model-huggingface/all``
111+
82112

83113
We have the following Huggingface models stored:
84114

@@ -96,6 +126,24 @@ We have the following Huggingface models stored:
96126
* Module command to load
97127
* Description
98128

129+
All Huggingface models can be loaded with: ``module load model-huggingface/all``,
130+
Here is a python script using huggingface model.
131+
132+
.. code-block:: python
133+
134+
#force transformer to use local hub instead of download from remote hub
135+
import os
136+
os.environ['TRANSFORMERS_OFFLINE'] = '1'
137+
138+
from transformers import AutoModelForCausalLM, AutoTokenizer
139+
140+
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
141+
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
142+
prompt = "How many stars in the space?"
143+
model_inputs = tokenizer([prompt], return_tensors="pt")
144+
input_length = model_inputs.input_ids.shape[1]
145+
generated_ids = model.generate(**model_inputs, max_new_tokens=20)
146+
print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])
99147
100148
llama.cpp and GGUF
101149
------------------
@@ -142,6 +190,15 @@ Each module will set the following environment variables:
142190
- ``MODEL_ROOT`` - Folder where model weights are stored.
143191
- ``MODEL_WEIGHTS`` - Path to the model weights in GGUF format.
144192

193+
This Python code snippet is part of a 'Chat with Your PDF Documents' example, utilizing LangChain and leveraging model weights stored in a .gguf file. For detailed environment setting up and python code, please check out `this repo <>`__.
194+
195+
.. code-block:: python
196+
197+
import os
198+
from langchain.llms import LlamaCpp
199+
200+
model_path = os.environ.get('MODEL_WEIGHTS')
201+
llm = LlamaCpp(model_path=model_path, verbose=False)
145202
146203
Ollama models
147204
-------------

0 commit comments

Comments
 (0)