| 
 | 1 | +# Using Local Models in Torchcha/  | 
 | 2 | +Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of   | 
 | 3 | +model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.  | 
 | 4 | + | 
 | 5 | +## Prerequisites  | 
 | 6 | +To work with local models, you need:  | 
 | 7 | +1. **Model Weights**: A checkpoint file (e.g., `.pth`, `.pt`) or a GGUF file (e.g., `.gguf`).  | 
 | 8 | +2. **Tokenizer**: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.  | 
 | 9 | +3. **Parameter File**: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with `--params-path`  | 
 | 10 | +   or `--params-table`, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.  | 
 | 11 | + | 
 | 12 | +Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection.  | 
 | 13 | +Let’s use a local download of the stories15M tinyllama model as an example:  | 
 | 14 | + | 
 | 15 | +```  | 
 | 16 | +mkdir stories15M  | 
 | 17 | +cd stories15M  | 
 | 18 | +wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt  | 
 | 19 | +wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model  | 
 | 20 | +cp ../torchchat/model_params/stories15M.json model.json  | 
 | 21 | +cd ..  | 
 | 22 | +```   | 
 | 23 | + | 
 | 24 | + | 
 | 25 | +## Using Local Checkpoints  | 
 | 26 | +Torchchat provides the CLI flag `--checkpoint-path` for specifying local model weights. The tokenizer is   | 
 | 27 | +loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.    | 
 | 28 | +This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the   | 
 | 29 | +models known to torchchat with a configuration stories in ‘torchchat/model_params’:  | 
 | 30 | + | 
 | 31 | + | 
 | 32 | +### Example 1: Basic Text Generation  | 
 | 33 | + | 
 | 34 | + | 
 | 35 | +```  | 
 | 36 | +python3 torchchat.py generate \  | 
 | 37 | + --checkpoint-path stories15M/stories15M.pt \  | 
 | 38 | + --prompt "Hello, my name is"  | 
 | 39 | +```  | 
 | 40 | + | 
 | 41 | + | 
 | 42 | +### Example 2: Providing Additional Artifacts  | 
 | 43 | +The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:  | 
 | 44 | +```  | 
 | 45 | +python3 torchchat.py generate \  | 
 | 46 | + --prompt "Once upon a time" \  | 
 | 47 | + --checkpoint-path stories15M/stories15M.pt \  | 
 | 48 | + --params-path stories15M/model.json \  | 
 | 49 | + --tokenizer-path stories15M/tokenizer.model  | 
 | 50 | +```  | 
 | 51 | + | 
 | 52 | + | 
 | 53 | +Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’   | 
 | 54 | +to specify a p[particular architecture in the ‘torchchat/model_params’:  | 
 | 55 | + | 
 | 56 | +```  | 
 | 57 | +python3 torchchat.py generate \  | 
 | 58 | + --prompt "Once upon a time" \  | 
 | 59 | + --checkpoint-path stories15M/stories15M.pt \  | 
 | 60 | + --params-table stories15M \  | 
 | 61 | + --tokenizer-path stories15M//tokenizer.model  | 
 | 62 | +```  | 
 | 63 | + | 
 | 64 | + | 
 | 65 | +## Using GGUF Models  | 
 | 66 | +Torchchat supports loading models in GGUF format using the `--gguf-file`. Refer to GGUF.md for additional   | 
 | 67 | +documentation about using GGUF files in torchchat.  | 
 | 68 | + | 
 | 69 | +The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model   | 
 | 70 | +configuration information is obtained directly from the GGUF file, simplifying setup and obviating the   | 
 | 71 | +need for a separate `model.json` model architecture specification.  | 
 | 72 | + | 
 | 73 | + | 
 | 74 | +## Using local models  | 
 | 75 | +Torchchat supports all commands such as chat, browser, server and export using local models. (In fact,   | 
 | 76 | +known models simply download and populate the parameters specified for local models.)   | 
 | 77 | +Here is an example setup for running a server with a local model:  | 
 | 78 | + | 
 | 79 | + | 
 | 80 | +[skip default]: begin  | 
 | 81 | +```  | 
 | 82 | +python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt  | 
 | 83 | +```  | 
 | 84 | +[skip default]: end  | 
 | 85 | + | 
 | 86 | + | 
 | 87 | +[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests  | 
 | 88 | + | 
 | 89 | + | 
 | 90 | +In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.  | 
 | 91 | + | 
 | 92 | + | 
 | 93 | +> [!NOTE]  | 
 | 94 | +> Since this feature is under active development, not every parameter is consumed. See `#api/api.pyi` for details on  | 
 | 95 | +> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).  | 
 | 96 | +
  | 
 | 97 | + | 
 | 98 | +<details>  | 
 | 99 | + | 
 | 100 | + | 
 | 101 | +<summary>Example Query</summary>  | 
 | 102 | +Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will   | 
 | 103 | +await the full response from the server.  | 
 | 104 | + | 
 | 105 | + | 
 | 106 | +**Example: using the server**  | 
 | 107 | +A model server used witha local model works like any other torchchat server.  You can test it by sending a request with ‘curl’:  | 
 | 108 | +```  | 
 | 109 | +curl http://127.0.0.1:5000/v1/chat/completions \  | 
 | 110 | +  -H "Content-Type: application/json" \  | 
 | 111 | +  -d '{  | 
 | 112 | +    "model": "llama3.1",  | 
 | 113 | +    "stream": "true",  | 
 | 114 | +    "max_tokens": 200,  | 
 | 115 | +    "messages": [  | 
 | 116 | +      {  | 
 | 117 | +        "role": "system",  | 
 | 118 | +        "content": "You are a helpful assistant."  | 
 | 119 | +      },  | 
 | 120 | +      {  | 
 | 121 | +        "role": "user",  | 
 | 122 | +        "content": "Hello!"  | 
 | 123 | +      }  | 
 | 124 | +    ]  | 
 | 125 | +  }'  | 
 | 126 | +```  | 
 | 127 | + | 
 | 128 | + | 
 | 129 | +[shell default]: kill ${server_pid}  | 
 | 130 | + | 
 | 131 | + | 
 | 132 | +</details>  | 
 | 133 | + | 
 | 134 | + | 
 | 135 | +For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.  | 
 | 136 | + | 
 | 137 | + | 
 | 138 | +[end default]: end  | 
0 commit comments