Extremely high perplexity on openai/gpt-oss-20b with WikiText-2 (raw)

### System Info

- `transformers` version: 4.56.1
- Platform: Linux-6.5.0-1025-gcp-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.35.0
- Safetensors version: 0.6.2
- Accelerate version: 1.10.1
- Accelerate config:    not found
- DeepSpeed version: 0.17.3+cu126.pt27.v0.17.3.recogni2
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA A100-SXM4-40GB

### Who can help?

@ArthurZucker @Cyrilvallez 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction


Script:
```python
#!/usr/bin/env python

import math

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

# Config
MODEL_NAME = "openai/gpt-oss-20b"
SPLIT = "test"  # WikiText-2 (raw) test split
CONTEXT_LENGTH = 2048  # evaluation window size
DTYPE = torch.bfloat16
DEVICE_MAP = "auto"


def main():
    # Load tokenizer & model
    tok = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=DTYPE, device_map=DEVICE_MAP).eval()

    # Load dataset and build one long token stream (no special tokens)
    ds = load_dataset("wikitext", "wikitext-2-raw-v1", split=SPLIT)
    encs = tok([row["text"] for row in ds], add_special_tokens=False)
    flat_ids = [tid for seq in encs["input_ids"] for tid in seq]
    ids = torch.tensor(flat_ids, dtype=torch.long)

    # Keep first 10% of tokens
    n_keep = max(1, int(0.10 * ids.numel()))
    ids = ids[:n_keep]

    # Keep only full CONTEXT_LENGTH windows
    n_windows = ids.numel() // CONTEXT_LENGTH
    if n_windows == 0:
        raise ValueError(f"Not enough tokens ({ids.numel()}) for a single {CONTEXT_LENGTH}-token window.")
    ids = ids[: n_windows * CONTEXT_LENGTH].view(n_windows, CONTEXT_LENGTH)

    # Forward passes
    total_nll, total_tokens = 0.0, 0
    with torch.no_grad():
        for i in range(n_windows):
            x = ids[i : i + 1].to(model.device)  # [1, L]
            out = model(input_ids=x, labels=x)  # HF shifts labels internally
            contrib = x.size(1) - 1  # L-1 positions contribute
            total_nll += out.loss.item() * contrib  # sum NLL
            total_tokens += contrib

    avg_nll = total_nll / total_tokens
    ppl = math.exp(avg_nll)

    # Detailed prints
    print("\n=== Repro Config ===")
    print(f"model_name:       {MODEL_NAME}")
    print(f"split:            {SPLIT}")
    print(f"context_length:   {CONTEXT_LENGTH}")
    print(f"dtype:            {DTYPE}")
    print(f"device_map:       {DEVICE_MAP}")
    print(f"tokens_total:     {ids.numel()}")
    print(f"num_segments:     {n_windows}")
    print(f"bos/eos/pad:      {tok.bos_token}/{tok.eos_token}/{tok.pad_token}")

    print("\n=== Results ===")
    print(f"tokens_scored:    {total_tokens}")
    print(f"avg_nll:          {avg_nll:.6f}")
    print(f"perplexity:       {ppl:.3f}\n")


if __name__ == "__main__":
    main()
```

Output:
```
=== Repro Config ===
model_name:       openai/gpt-oss-20b
split:            test
context_length:   2048
dtype:            torch.bfloat16
device_map:       auto
tokens_total:     28672
num_segments:     14
bos/eos/pad:      <|startoftext|>/<|return|>/<|endoftext|>

=== Results ===
tokens_scored:    28658
avg_nll:          5.977535
perplexity:       394.467
```

### Expected behavior

When evaluating `openai/gpt-oss-20b` on the WikiText-2 (raw) test split with a standard perplexity script, the reported perplexity is extremely high (~394). This is surprising, as a 20B parameter GPT-class model should normally achieve much lower perplexity on this benchmark.

Clarification would be helpful to determine whether this behavior indicates a bug in the Transformers integration or if GPT-OSS models are not intended to be directly evaluated as causal LMs without special formatting.

Note: The model card mentions a “harmony” chat template for usage, but it is unclear whether special formatting is required when performing perplexity evaluation on a corpus like WikiText.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extremely high perplexity on openai/gpt-oss-20b with WikiText-2 (raw) #40990

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extremely high perplexity on openai/gpt-oss-20b with WikiText-2 (raw) #40990

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions