Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit 45ec014

Browse files
authored
Merge branch 'pytorch:main' into patch-24
2 parents c7f61be + cbc72a4 commit 45ec014

File tree

8 files changed

+385
-70
lines changed

8 files changed

+385
-70
lines changed

.ci/scripts/run-docs

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
125125
bash -x ./run-native.sh
126126
echo "::endgroup::"
127127
fi
128+
129+
if [ "$1" == "distributed" ]; then
130+
131+
echo "::group::Create script to run distributed"
132+
python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
133+
# for good measure, if something happened to updown processor,
134+
# and it did not error out, fail with an exit 1
135+
echo "exit 1" >> ./run-distributed.sh
136+
echo "::endgroup::"
137+
138+
echo "::group::Run distributed"
139+
echo "*******************************************"
140+
cat ./run-distributed.sh
141+
echo "*******************************************"
142+
bash -x ./run-distributed.sh
143+
echo "::endgroup::"
144+
fi

.github/workflows/runner-cuda-dtype.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ jobs:
5252
5353
python torchchat.py export --checkpoint-path ${MODEL_DIR}/stories15M.pt --output-aoti-package-path /tmp/model.pt2
5454
55-
./cmake-out/aoti_run /tmp/model.pt2 -d CUDA -z ${MODEL_DIR}/tokenizer.model -i "${PROMPT}"
55+
./cmake-out/aoti_run /tmp/model.pt2 -z ${MODEL_DIR}/tokenizer.model -i "${PROMPT}"
5656
5757
done
5858

README.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ aliases.
6969
|[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories42M`.|
7070
|[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)||Toy model for `generate`. Alias to `stories110M`.|
7171
|[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)||Best for `generate`. Alias to `open-llama`.|
72+
| [ibm-granite/granite-3b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) || Alias to `granite-code` and `granite-code-3b`.|
73+
| [ibm-granite/granite-8b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) || Alias to `granite-code-8b`.|
74+
| [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) || Alias to `granite3-2b` and `granite3`.|
75+
| [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) || Alias to `granite3-8b`.|
76+
| [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) || Alias to `granite3.1-2b` and `granite3.1`.|
77+
| [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct) || Alias to `granite3.1-8b`.|
78+
7279

7380
## Installation
7481
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
@@ -334,7 +341,7 @@ torchchat/utils/scripts/build_native.sh aoti
334341

335342
Then run the compiled executable, with the pt2.
336343
```bash
337-
cmake-out/aoti_run exportedModels/llama3_1_artifacts.pt2 -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"
344+
cmake-out/aoti_run exportedModels/llama3_1_artifacts.pt2 -z `python3 torchchat.py where llama3.1`/tokenizer.model -i "Once upon a time"
338345
```
339346

340347
## Mobile Execution

docs/distributed.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
# Distributed Inference with torchchat
2+
3+
torchchat supports distributed inference for large language models (LLMs) on GPUs seamlessly.
4+
At present, torchchat supports distributed inference using Python only.
5+
6+
## Installation
7+
The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
8+
9+
> [!TIP]
10+
> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
11+
12+
[skip default]: begin
13+
```bash
14+
git clone https://github.com/pytorch/torchchat.git
15+
cd torchchat
16+
python3 -m venv .venv
17+
source .venv/bin/activate
18+
./install/install_requirements.sh
19+
```
20+
[skip default]: end
21+
22+
[shell default]: ./install/install_requirements.sh
23+
24+
## Login to HF for Downloading Weights
25+
Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.
26+
27+
Log into Hugging Face:
28+
29+
[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
30+
31+
```
32+
huggingface-cli login
33+
```
34+
35+
## Enabling Distributed torchchat Inference
36+
37+
To enable distributed inference, use the option `--distributed`. In addition, `--tp <num>` and `--pp <num>`
38+
allow users to specify the types of parallelism to use where tp refers to tensor parallelism and pp to pipeline parallelism.
39+
40+
41+
## Generate Output with Distributed torchchat Inference
42+
43+
To generate output using distributed inference with 4 GPUs, you can use:
44+
```
45+
python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
46+
```
47+
48+
49+
## Chat with Distributed torchchat Inference
50+
51+
This mode allows you to chat with an LLM in an interactive fashion with distributed Inference. The following example uses 4 GPUs:
52+
53+
[skip default]: begin
54+
```bash
55+
python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
56+
```
57+
[skip default]: end
58+
59+
60+
## A Server with Distributed torchchat Inference
61+
62+
This mode exposes a REST API for interacting with a model.
63+
The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
64+
65+
To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
66+
67+
In one terminal, start the server to run with 4 GPUs:
68+
69+
[skip default]: begin
70+
71+
```bash
72+
python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
73+
```
74+
[skip default]: end
75+
76+
<!--
77+
[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
78+
-->
79+
80+
In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
81+
82+
> [!NOTE]
83+
> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
84+
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
85+
86+
<details>
87+
<summary>Example Query</summary>
88+
89+
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
90+
91+
**Example Input + Output**
92+
93+
```
94+
curl http://127.0.0.1:5000/v1/chat/completions \
95+
-H "Content-Type: application/json" \
96+
-d '{
97+
"model": "llama3.1",
98+
"stream": "true",
99+
"max_tokens": 200,
100+
"messages": [
101+
{
102+
"role": "system",
103+
"content": "You are a helpful assistant."
104+
},
105+
{
106+
"role": "user",
107+
"content": "Hello!"
108+
}
109+
]
110+
}'
111+
```
112+
[skip default]: begin
113+
```
114+
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
115+
```
116+
117+
[skip default]: end
118+
119+
<!--
120+
[shell default]: kill ${server_pid}
121+
-->
122+
123+
</details>
124+
125+
[end default]: end

docs/local-model.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Using Local Models in Torchcha/
2+
Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of
3+
model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.
4+
5+
## Prerequisites
6+
To work with local models, you need:
7+
1. **Model Weights**: A checkpoint file (e.g., `.pth`, `.pt`) or a GGUF file (e.g., `.gguf`).
8+
2. **Tokenizer**: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.
9+
3. **Parameter File**: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with `--params-path`
10+
or `--params-table`, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.
11+
12+
Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection.
13+
Let’s use a local download of the stories15M tinyllama model as an example:
14+
15+
```
16+
mkdir stories15M
17+
cd stories15M
18+
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
19+
wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model
20+
cp ../torchchat/model_params/stories15M.json model.json
21+
cd ..
22+
```
23+
24+
25+
## Using Local Checkpoints
26+
Torchchat provides the CLI flag `--checkpoint-path` for specifying local model weights. The tokenizer is
27+
loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.
28+
This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the
29+
models known to torchchat with a configuration stories in ‘torchchat/model_params’:
30+
31+
32+
### Example 1: Basic Text Generation
33+
34+
35+
```
36+
python3 torchchat.py generate \
37+
--checkpoint-path stories15M/stories15M.pt \
38+
--prompt "Hello, my name is"
39+
```
40+
41+
42+
### Example 2: Providing Additional Artifacts
43+
The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:
44+
```
45+
python3 torchchat.py generate \
46+
--prompt "Once upon a time" \
47+
--checkpoint-path stories15M/stories15M.pt \
48+
--params-path stories15M/model.json \
49+
--tokenizer-path stories15M/tokenizer.model
50+
```
51+
52+
53+
Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’
54+
to specify a p[particular architecture in the ‘torchchat/model_params’:
55+
56+
```
57+
python3 torchchat.py generate \
58+
--prompt "Once upon a time" \
59+
--checkpoint-path stories15M/stories15M.pt \
60+
--params-table stories15M \
61+
--tokenizer-path stories15M//tokenizer.model
62+
```
63+
64+
65+
## Using GGUF Models
66+
Torchchat supports loading models in GGUF format using the `--gguf-file`. Refer to GGUF.md for additional
67+
documentation about using GGUF files in torchchat.
68+
69+
The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model
70+
configuration information is obtained directly from the GGUF file, simplifying setup and obviating the
71+
need for a separate `model.json` model architecture specification.
72+
73+
74+
## Using local models
75+
Torchchat supports all commands such as chat, browser, server and export using local models. (In fact,
76+
known models simply download and populate the parameters specified for local models.)
77+
Here is an example setup for running a server with a local model:
78+
79+
80+
[skip default]: begin
81+
```
82+
python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt
83+
```
84+
[skip default]: end
85+
86+
87+
[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests
88+
89+
90+
In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
91+
92+
93+
> [!NOTE]
94+
> Since this feature is under active development, not every parameter is consumed. See `#api/api.pyi` for details on
95+
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
96+
97+
98+
<details>
99+
100+
101+
<summary>Example Query</summary>
102+
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will
103+
await the full response from the server.
104+
105+
106+
**Example: using the server**
107+
A model server used witha local model works like any other torchchat server. You can test it by sending a request with ‘curl’:
108+
```
109+
curl http://127.0.0.1:5000/v1/chat/completions \
110+
-H "Content-Type: application/json" \
111+
-d '{
112+
"model": "llama3.1",
113+
"stream": "true",
114+
"max_tokens": 200,
115+
"messages": [
116+
{
117+
"role": "system",
118+
"content": "You are a helpful assistant."
119+
},
120+
{
121+
"role": "user",
122+
"content": "Hello!"
123+
}
124+
]
125+
}'
126+
```
127+
128+
129+
[shell default]: kill ${server_pid}
130+
131+
132+
</details>
133+
134+
135+
For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.
136+
137+
138+
[end default]: end

0 commit comments

Comments
 (0)