Merge branch 'pytorch:main' into patch-23

mikekgfb · web-flow · commit d8c6e22c1ba3 · 2025-01-06T08:29:05.000-08:00
diff --git a/.ci/scripts/run-docs b/.ci/scripts/run-docs
@@ -125,3 +125,20 @@ if [ "$1" == "native" ]; then
         bash -x ./run-native.sh
         echo "::endgroup::"
 fi
+
+if [ "$1" == "distributed" ]; then
+
+        echo "::group::Create script to run distributed"
+        python3 torchchat/utils/scripts/updown.py --file docs/distributed.md > ./run-distributed.sh
+        # for good measure, if something happened to updown processor,
+        # and it did not error out, fail with an exit 1
+        echo "exit 1" >> ./run-distributed.sh
+        echo "::endgroup::"
+
+        echo "::group::Run distributed"
+        echo "*******************************************"
+        cat ./run-distributed.sh
+        echo "*******************************************"
+        bash -x ./run-distributed.sh
+        echo "::endgroup::"
+fi
diff --git a/README.md b/README.md
@@ -69,6 +69,13 @@ aliases.
 |[tinyllamas/stories42M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories42M`.|
 |[tinyllamas/stories110M](https://huggingface.co/karpathy/tinyllamas/tree/main)|✅|Toy model for `generate`. Alias to `stories110M`.|
 |[openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b)|✅|Best for `generate`. Alias to `open-llama`.|
+| [ibm-granite/granite-3b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-3b-code-instruct-128k) |✅| Alias to `granite-code` and `granite-code-3b`.|
+| [ibm-granite/granite-8b-code-instruct-128k](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) |✅| Alias to `granite-code-8b`.|
+| [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) |✅| Alias to `granite3-2b` and `granite3`.|
+| [ibm-granite/granite-3.0-8b-instruct](https://huggingface.co/ibm-granite/granite-3.0-8b-instruct) |✅| Alias to `granite3-8b`.|
+| [ibm-granite/granite-3.1-2b-instruct](https://huggingface.co/ibm-granite/granite-3.1-2b-instruct) |✅| Alias to `granite3.1-2b` and `granite3.1`.|
+| [ibm-granite/granite-3.1-8b-instruct](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct) |✅| Alias to `granite3.1-8b`.|
+
 
 ## Installation
 The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
diff --git a/docs/distributed.md b/docs/distributed.md
@@ -0,0 +1,125 @@
+# Distributed Inference with torchchat
+
+torchchat supports distributed inference for large language models (LLMs) on GPUs seamlessly. 
+At present, torchchat supports distributed inference using Python only.
+
+## Installation
+The following steps require that you have [Python 3.10](https://www.python.org/downloads/release/python-3100/) installed.
+
+> [!TIP]
+> torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
+
+[skip default]: begin
+```bash
+git clone https://github.com/pytorch/torchchat.git
+cd torchchat
+python3 -m venv .venv
+source .venv/bin/activate
+./install/install_requirements.sh
+```
+[skip default]: end
+
+[shell default]: ./install/install_requirements.sh
+
+## Login to HF for Downloading Weights
+Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.
+
+Log into Hugging Face:
+
+[prefix default]: HF_TOKEN="${SECRET_HF_TOKEN_PERIODIC}"
+
+```
+huggingface-cli login
+```
+
+## Enabling Distributed torchchat Inference
+
+To enable distributed inference, use the option `--distributed`.  In addition, `--tp <num>` and `--pp <num>` 
+allow users to specify the types of parallelism to use where tp refers to tensor parallelism and pp to pipeline parallelism.
+
+
+## Generate Output with Distributed torchchat Inference
+
+To generate output using distributed inference with 4 GPUs, you can use:
+```
+python3 torchchat.py generate llama3.1 --distributed --tp 2 --pp 2 --prompt "write me a story about a boy and his bear"
+```
+
+
+## Chat with Distributed torchchat Inference
+
+This mode allows you to chat with an LLM in an interactive fashion with distributed Inference.  The following example uses 4 GPUs:
+
+[skip default]: begin
+```bash
+python3 torchchat.py chat llama3.1 --max-new-tokens 10 --distributed --tp 2 --pp 2
+```
+[skip default]: end
+
+
+## A Server with Distributed torchchat Inference
+
+This mode exposes a REST API for interacting with a model.
+The server follows the [OpenAI API specification](https://platform.openai.com/docs/api-reference/chat) for chat completions.
+
+To test out the REST API, **you'll need 2 terminals**: one to host the server, and one to send the request.
+
+In one terminal, start the server to run with 4 GPUs:
+
+[skip default]: begin
+
+```bash
+python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2
+```
+[skip default]: end
+
+<!--
+[shell default]: python3 torchchat.py server llama3.1 --distributed --tp 2 --pp 2 & server_pid=$! ; sleep 180 # wait for server to be ready to accept requests
+-->
+
+In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
+
+> [!NOTE]
+> Since this feature is under active development, not every parameter is consumed. See api/api.py for details on
+> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
+
+<details>
+<summary>Example Query</summary>
+
+Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will await the full response from the server.
+
+**Example Input + Output**
+
+```
+curl http://127.0.0.1:5000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1",
+    "stream": "true",
+    "max_tokens": 200,
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Hello!"
+      }
+    ]
+  }'
+```
+[skip default]: begin
+```
+{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
+```
+
+[skip default]: end
+
+<!--
+[shell default]: kill ${server_pid}
+-->
+
+</details>
+
+[end default]: end
diff --git a/docs/local-model.md b/docs/local-model.md
@@ -0,0 +1,138 @@
+# Using Local Models in Torchcha/
+Torchchat provides powerful capabilities for running large language models (LLMs) locally. This guide focuses on utilizing local copies of 
+model checkpoints or models in GGUF format to create a chat application. It also highlights relevant options for advanced users.
+
+## Prerequisites
+To work with local models, you need:
+1. **Model Weights**: A checkpoint file (e.g., `.pth`, `.pt`) or a GGUF file (e.g., `.gguf`).
+2. **Tokenizer**: A tokenizer model file.This can either be in SentencePiece or TikToken format, depending on the tokenizer used with the model.
+3. **Parameter File**: (a) A custom parameter file in JSON format, or (b) a pre-existing parameter file with `--params-path`
+   or `--params-table`, or (c) a pathname that’s matched against known models by longest substring in configuration name, using the same algorithm as GPT-fast.
+
+Ensure the tokenizer and parameter files are in the same directory as the checkpoint or GGUF file for automatic detection.
+Let’s use a local download of the stories15M tinyllama model as an example:
+
+```
+mkdir stories15M
+cd stories15M
+wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.pt
+wget https://github.com/karpathy/llama2.c/raw/refs/heads/master/tokenizer.model
+cp ../torchchat/model_params/stories15M.json model.json
+cd ..
+``` 
+
+
+## Using Local Checkpoints
+Torchchat provides the CLI flag `--checkpoint-path` for specifying local model weights. The tokenizer is 
+loaded from the same directory as the checkpoint with the name ‘tokenizer.model’ unless separately specified.  
+This example obtains the model parameters by name matching to known models because ‘stories15M’ is one of the 
+models known to torchchat with a configuration stories in ‘torchchat/model_params’:
+
+
+### Example 1: Basic Text Generation
+
+
+```
+python3 torchchat.py generate \
+ --checkpoint-path stories15M/stories15M.pt \
+ --prompt "Hello, my name is"
+```
+
+
+### Example 2: Providing Additional Artifacts
+The following is an example of how to specify a local model checkpoint, the model architecture, and a tokenizer file:
+```
+python3 torchchat.py generate \
+ --prompt "Once upon a time" \
+ --checkpoint-path stories15M/stories15M.pt \
+ --params-path stories15M/model.json \
+ --tokenizer-path stories15M/tokenizer.model
+```
+
+
+Alternatively, we can specify the known architecture configuration for known models using ‘--params-table’ 
+to specify a p[particular architecture in the ‘torchchat/model_params’:
+
+```
+python3 torchchat.py generate \
+ --prompt "Once upon a time" \
+ --checkpoint-path stories15M/stories15M.pt \
+ --params-table stories15M \
+ --tokenizer-path stories15M//tokenizer.model
+```
+
+
+## Using GGUF Models
+Torchchat supports loading models in GGUF format using the `--gguf-file`. Refer to GGUF.md for additional 
+documentation about using GGUF files in torchchat.
+
+The GGUF format is compatible with several quantization levels such as F16, F32, Q4_0, and Q6_K. Model 
+configuration information is obtained directly from the GGUF file, simplifying setup and obviating the 
+need for a separate `model.json` model architecture specification.
+
+
+## Using local models
+Torchchat supports all commands such as chat, browser, server and export using local models. (In fact, 
+known models simply download and populate the parameters specified for local models.) 
+Here is an example setup for running a server with a local model:
+
+
+[skip default]: begin
+```
+python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt
+```
+[skip default]: end
+
+
+[shell default]: python3 torchchat.py server --checkpoint-path stories15M/stories15M.pt & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests
+
+
+In another terminal, query the server using `curl`. Depending on the model configuration, this query might take a few minutes to respond.
+
+
+> [!NOTE]
+> Since this feature is under active development, not every parameter is consumed. See `#api/api.pyi` for details on
+> which request parameters are implemented. If you encounter any issues, please comment on the [tracking Github issue](https://github.com/pytorch/torchchat/issues/973).
+
+
+<details>
+
+
+<summary>Example Query</summary>
+Setting `stream` to "true" in the request emits a response in chunks. If `stream` is unset or not "true", then the client will 
+await the full response from the server.
+
+
+**Example: using the server**
+A model server used witha local model works like any other torchchat server.  You can test it by sending a request with ‘curl’:
+```
+curl http://127.0.0.1:5000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "llama3.1",
+    "stream": "true",
+    "max_tokens": 200,
+    "messages": [
+      {
+        "role": "system",
+        "content": "You are a helpful assistant."
+      },
+      {
+        "role": "user",
+        "content": "Hello!"
+      }
+    ]
+  }'
+```
+
+
+[shell default]: kill ${server_pid}
+
+
+</details>
+
+
+For more information about using different commands, see the root README.md and refer to the Advanced Users Guide for further details on advanced configurations and parameter tuning.
+
+
+[end default]: end
diff --git a/torchchat/utils/docs/evaluation.md b/torchchat/utils/docs/evaluation.md
@@ -23,7 +23,7 @@ The evaluation mode of `torchchat.py` script can be used to evaluate your langua
 
 ## Examples
 
-### Evaluation example with model in Python
+### Evaluation example with model in Python environment
 
 Running wikitext for 10 iterations
 ```
@@ -35,33 +35,45 @@ Running wikitext with torch.compile for 10 iterations
 python3 torchchat.py eval stories15M --compile --tasks wikitext --limit 10
 ```
 
-Running multiple tasks and calling eval.py directly (with torch.compile):
+Running multiple tasks with torch.compile for evaluation and prefill:
 ```
-python3 torchchat.py eval stories15M --compile --tasks wikitext hellaswag
+python3 torchchat.py eval stories15M --compile --compile-prefill --tasks wikitext hellaswag
 ```
 
 ### Evaluation with model exported to PTE with ExecuTorch
 
-Running an exported model with ExecuTorch (as PTE)
+Running an exported model with ExecuTorch (as PTE).  Advantageously, because you can 
+load an exported PTE model back into the Python environment with torchchat,
+you can run evaluation on the exported model!
 ```
 python3 torchchat.py export stories15M --output-pte-path stories15M.pte
 python3 torchchat.py eval stories15M --pte-path stories15M.pte
 ```
 
-Running multiple tasks and calling eval.py directly (with PTE):
+Running multiple tasks directly on the created PTE mobile model:
 ```
 python3 torchchat.py eval stories15M --pte-path stories15M.pte --tasks wikitext hellaswag
 ```
 
+Now let's evaluate the effect of quantization on evaluation results by exporting with quantization using `--quantize` and an exemplary quantization configuration:
+```
+python3 torchchat.py export stories15M --output-pte-path stories15M.pte --quantize torchchat/quant_config/mobile.json
+python3 torchchat.py eval stories15M --pte-path stories15M.pte --tasks wikitext hellaswag
+```
+
+Now try your own export options to explore different trade-offs between model size, evaluation speed and accuracy using model quantization!
+
 ### Evaluation with model exported to DSO with AOT Inductor (AOTI)
 
-Running an exported model with AOT Inductor (DSO model)
+Running an exported model with AOT Inductor (DSO model).  Advantageously, because you can 
+load an exported DSO model back into the Python environment with torchchat,
+you can run evaluation on the exported model!
 ```
 python3 torchchat.py export stories15M --dtype fast16 --output-dso-path stories15M.so
 python3 torchchat.py eval stories15M --dtype fast16 --dso-path stories15M.so
 ```
 
-Running multiple tasks and calling eval.py directly (with AOTI):
+Running multiple tasks with AOTI:
 ```
 python3 torchchat.py eval stories15M --dso-path stories15M.so --tasks wikitext hellaswag
 ```