You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Corrected several typographical errors and improved grammar throughout the README for better clarity and professionalism. Changes include fixing word forms, possessives, and minor phrasing issues.
Co-authored-by: Romain Huet <[email protected]>
vLLM recommends using [`uv`](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server.
66
+
vLLM recommends using [`uv`](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible web server. The following command will automatically download the model and start the server.
67
67
68
68
```bash
69
69
uv pip install --pre vllm==0.10.1+gptoss \
@@ -130,7 +130,7 @@ This repository provides a collection of reference implementations:
130
130
131
131
### Requirements
132
132
133
-
-python 3.12
133
+
-Python 3.12
134
134
- On macOS: Install the Xcode CLI tools --> `xcode-select --install`
135
135
- On Linux: These reference implementations require CUDA
136
136
- On Windows: These reference implementations have not been tested on Windows. Try using solutions like Ollama if you are trying to run the model locally.
We include an inefficient reference PyTorch implementation in [gpt_oss/torch/model.py](gpt_oss/torch/model.py). This code uses basic PyTorch operators to show the exact model architecture, with a small addition of supporting tensor parallelism in MoE so that the larger model can run with this code (e.g., on 4xH100 or 2xH200). In this implementation, we upcast all weights to BF16 and run the model in BF16.
173
173
174
-
To run the reference implementation, install these dependencies:
174
+
To run the reference implementation, install the dependencies:
175
175
176
176
```shell
177
177
pip install -e ".[torch]"
@@ -227,7 +227,7 @@ To perform inference you'll need to first convert the SafeTensor weights from Hu
@@ -250,7 +250,7 @@ We also include two system tools for the model: browsing and python container. C
250
250
251
251
### Terminal Chat
252
252
253
-
The terminal chat application is a basic example on how to use the harmony format together with the PyTorch, Triton, and vLLM implementations. It also exposes both the python and browser tool as optional tools that can be used.
253
+
The terminal chat application is a basic example of how to use the harmony format together with the PyTorch, Triton, and vLLM implementations. It also exposes both the python and browser tool as optional tools that can be used.
@@ -289,7 +289,7 @@ You can start this server with the following inference backends:
289
289
290
290
-`triton` — uses the triton implementation
291
291
-`metal` — uses the metal implementation on Apple Silicon only
292
-
-`ollama` — uses the Ollama /api/generate API as a inference solution
292
+
-`ollama` — uses the Ollama /api/generate API as an inference solution
293
293
-`vllm` — uses your installed vllm version to perform inference
294
294
-`transformers` — uses your installed transformers version to perform local inference
295
295
@@ -468,10 +468,10 @@ if last_message.recipient == "python":
468
468
469
469
We released the models with native quantization support. Specifically, we use [MXFP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) for the linear projection weights in the MoE layer. We store the MoE tensor in two parts:
470
470
471
-
-`tensor.blocks` stores the actual fp4 values. We pack every two value in one `uint8` value.
471
+
-`tensor.blocks` stores the actual fp4 values. We pack every two values in one `uint8` value.
472
472
-`tensor.scales` stores the block scale. The block scaling is done among the last dimension for all MXFP4 tensors.
473
473
474
-
All other tensors will be in BF16. We also recommend use BF16 as the activation precision for the model.
474
+
All other tensors will be in BF16. We also recommend using BF16 as the activation precision for the model.
0 commit comments