Skip to content

docs(readme): grammar, Harmony capitalization, MoE & LM Studio wording #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Welcome to the gpt-oss series, [OpenAI's open-weight models](https://openai.com/
We're releasing two flavors of these open models:

- `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters)
- `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)
- `gpt-oss-20b` — for lower latency and local or specialized use cases (21B parameters with 3.6B active parameters)

Both models were trained using our [harmony response format][harmony] and should only be used with this format; otherwise, they will not work correctly.

Expand All @@ -33,7 +33,7 @@ Both models were trained using our [harmony response format][harmony] and should

#### Transformers

You can use `gpt-oss-120b` and `gpt-oss-20b` with the Transformers library. If you use Transformers' chat template, it will automatically apply the [harmony response format][harmony]. If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [`openai-harmony`][harmony] package.
You can use `gpt-oss-120b` and `gpt-oss-20b` with the Transformers library. If you use the Transformers chat template, it will automatically apply the [Harmony response format][harmony]. If you use `model.generate` directly, you need to apply the Harmony format manually using the chat template or use our [`openai-harmony`][harmony] package.

```python
from transformers import pipeline
Expand Down Expand Up @@ -100,7 +100,7 @@ ollama run gpt-oss:120b

#### LM Studio

If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download.
If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download the models.

```bash
# gpt-oss-20b
Expand All @@ -120,8 +120,8 @@ This repository provides a collection of reference implementations:
- [`triton`](#reference-triton-implementation-single-gpu) — a more optimized implementation using [PyTorch](https://pytorch.org/) & [Triton](https://github.com/triton-lang/triton) incl. using CUDA graphs and basic caching
- [`metal`](#reference-metal-implementation) — a Metal-specific implementation for running the models on Apple Silicon hardware
- **Tools:**
- [`browser`](#browser) — a reference implementation of the browser tool the models got trained on
- [`python`](#python) — a stateless reference implementation of the python tool the model got trained on
- [`browser`](#browser) — a reference implementation of the browser tool the models were trained on
- [`python`](#python) — a stateless reference implementation of the Python tool the models were trained on
- **Client examples:**
- [`chat`](#terminal-chat) — a basic terminal chat application that uses the PyTorch or Triton implementations for inference along with the python and browser tools
- [`responses_api`](#responses-api) — an example Responses API compatible server that implements the browser tool along with other Responses-compatible functionality
Expand Down Expand Up @@ -221,7 +221,7 @@ The implementation will get automatically compiled when running the `.[metal]` i
pip install -e ".[metal]"
```

To perform inference you'll need to first convert the SafeTensor weights from Hugging Face into the right format using:
To perform inference you'll need to first convert the SafeTensors weights from Hugging Face into the right format using:

```shell
python gpt_oss/metal/scripts/create-local-model.py -s <model_dir> -d <output_file>
Expand Down Expand Up @@ -279,7 +279,7 @@ options:
```

> [!NOTE]
> The torch and triton implementations require original checkpoint under `gpt-oss-120b/original/` and `gpt-oss-20b/original/` respectively. While vLLM uses the Hugging Face converted checkpoint under `gpt-oss-120b/` and `gpt-oss-20b/` root directory respectively.
> The torch and Triton implementations require the original checkpoints under `gpt-oss-120b/original/` and `gpt-oss-20b/original/` respectively. While vLLM uses the Hugging Faceconverted checkpoints under the `gpt-oss-120b/` and `gpt-oss-20b/` root directories.

### Responses API

Expand Down Expand Up @@ -466,12 +466,12 @@ if last_message.recipient == "python":

### Precision format

We released the models with native quantization support. Specifically, we use [MXFP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) for the linear projection weights in the MoE layer. We store the MoE tensor in two parts:
We released the models with native quantization support. Specifically, we use [MXFP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) for the linear projection weights in the MoE layer. We store each MoE tensor in two parts:

- `tensor.blocks` stores the actual fp4 values. We pack every two values in one `uint8` value.
- `tensor.blocks` stores the actual FP4 values. We pack every two values into one `uint8` value.
- `tensor.scales` stores the block scale. The block scaling is done among the last dimension for all MXFP4 tensors.

All other tensors will be in BF16. We also recommend using BF16 as the activation precision for the model.
All other tensors are in BF16. We also recommend using BF16 for activations precision for the model.

### Recommended Sampling Parameters

Expand Down