Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Commit b4547fd

Browse files
authored
Create local-model.md (#1448)
Initial documentation how to use local checkpoints with torchchat. xref: #1446
1 parent 6c3f2b8 commit b4547fd

File tree

1 file changed

+138
-0
lines changed

1 file changed

+138
-0
lines changed

docs/local-model.md

Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
# Using Local Models in Torchcha/
2+
Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of
3+
model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.
4+
5+
## Prerequisites
6+
To work with local models, you need:
7+
1. **Model Weights**: A checkpoint file (e.g., `.pth`, `.pt`) or a GGUF file (e.g., `.gguf`).
8+
2. **Tokenizer**: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.
9+
3. **Parameter File**: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with `--params-path`
10+
or `--params-table`, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.
11+
12+
Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection.
13+
Let’s use a local download of the stories15M tinyllama model as an example:
14+
15+
```
16+
mkdir stories15M
17+
cd stories15M
18+
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
19+
wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model
20+
cp ../torchchat/model_params/stories15M.json model.json
21+
cd ..
22+
```
23+
24+
25+
## Using Local Checkpoints
26+
Torchchat provides the CLI flag `--checkpoint-path` for specifying local model weights. The tokenizer is
27+
loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.
28+
This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the
29+
models known to torchchat with a configuration stories in ‘torchchat/model_params’:
30+
31+
32+
### Example 1: Basic Text Generation
33+
34+
35+
```
36+
python3 torchchat.py generate \
37+
--checkpoint-path stories15M/stories15M.pt \
38+
--prompt "Hello, my name is"
39+
```
40+
41+
42+
### Example 2: Providing Additional Artifacts
43+
The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:
44+
```
45+
python3 torchchat.py generate \
46+
--prompt "Once upon a time" \
47+
--checkpoint-path stories15M/stories15M.pt \
48+
--params-path stories15M/model.json \
49+
--tokenizer-path stories15M/tokenizer.model
50+
```
51+
52+
53+
Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’
54+
to specify a p[particular architecture in the ‘torchchat/model_params’:
55+
56+
```
57+
python3 torchchat.py generate \
58+
--prompt "Once upon a time" \
59+
--checkpoint-path stories15M/stories15M.pt \
60+
--params-table stories15M \
61+
--tokenizer-path stories15M//tokenizer.model
62+
```
63+
64+
65+
## Using GGUF Models
66+
Torchchat supports loading models in GGUF format using the `--gguf-file`. Refer to GGUF.md for additional
67+
documentation about using GGUF files in torchchat.
68+
69+
The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model
70+
configuration information is obtained directly from the GGUF file, simplifying setup and obviating the
71+
need for a separate `model.json` model architecture specification.
72+
73+
74+
## Using local models
75+
Torchchat supports all commands such as chat, browser, server and export using local models. (In fact,
76+
known models simply download and populate the parameters specified for local models.)
77+
Here is an example setup for running a server with a local model:
78+
79+
80+
[skip default]: begin
81+
```
82+
python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt
83+
```
84+
[skip default]: end
85+
86+
87+
[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests
88+
89+
90+
In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
91+
92+
93+
> [!NOTE]
94+
> Since this feature is under active development, not every parameter is consumed. See `#api/api.pyi` for details on
95+
> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
96+
97+
98+
<details>
99+
100+
101+
<summary>Example Query</summary>
102+
Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will
103+
await the full response from the server.
104+
105+
106+
**Example: using the server**
107+
A model server used witha local model works like any other torchchat server. You can test it by sending a request with ‘curl’:
108+
```
109+
curl http://127.0.0.1:5000/v1/chat/completions \
110+
-H "Content-Type: application/json" \
111+
-d '{
112+
"model": "llama3.1",
113+
"stream": "true",
114+
"max_tokens": 200,
115+
"messages": [
116+
{
117+
"role": "system",
118+
"content": "You are a helpful assistant."
119+
},
120+
{
121+
"role": "user",
122+
"content": "Hello!"
123+
}
124+
]
125+
}'
126+
```
127+
128+
129+
[shell default]: kill ${server_pid}
130+
131+
132+
</details>
133+
134+
135+
For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.
136+
137+
138+
[end default]: end

0 commit comments

Comments
 (0)