Skip to content

Extremely high perplexity on openai/gpt-oss-20b with WikiText-2 (raw) #40990

@kuantuna

Description

@kuantuna

System Info

  • transformers version: 4.56.1
  • Platform: Linux-6.5.0-1025-gcp-x86_64-with-glibc2.35
  • Python version: 3.11.10
  • Huggingface_hub version: 0.35.0
  • Safetensors version: 0.6.2
  • Accelerate version: 1.10.1
  • Accelerate config: not found
  • DeepSpeed version: 0.17.3+cu126.pt27.v0.17.3.recogni2
  • PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: no
  • Using GPU in script?: yes
  • GPU type: NVIDIA A100-SXM4-40GB

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Script:

#!/usr/bin/env python

import math

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

# Config
MODEL_NAME = "openai/gpt-oss-20b"
SPLIT = "test"  # WikiText-2 (raw) test split
CONTEXT_LENGTH = 2048  # evaluation window size
DTYPE = torch.bfloat16
DEVICE_MAP = "auto"


def main():
    # Load tokenizer & model
    tok = AutoTokenizer.from_pretrained(MODEL_NAME)
    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=DTYPE, device_map=DEVICE_MAP).eval()

    # Load dataset and build one long token stream (no special tokens)
    ds = load_dataset("wikitext", "wikitext-2-raw-v1", split=SPLIT)
    encs = tok([row["text"] for row in ds], add_special_tokens=False)
    flat_ids = [tid for seq in encs["input_ids"] for tid in seq]
    ids = torch.tensor(flat_ids, dtype=torch.long)

    # Keep first 10% of tokens
    n_keep = max(1, int(0.10 * ids.numel()))
    ids = ids[:n_keep]

    # Keep only full CONTEXT_LENGTH windows
    n_windows = ids.numel() // CONTEXT_LENGTH
    if n_windows == 0:
        raise ValueError(f"Not enough tokens ({ids.numel()}) for a single {CONTEXT_LENGTH}-token window.")
    ids = ids[: n_windows * CONTEXT_LENGTH].view(n_windows, CONTEXT_LENGTH)

    # Forward passes
    total_nll, total_tokens = 0.0, 0
    with torch.no_grad():
        for i in range(n_windows):
            x = ids[i : i + 1].to(model.device)  # [1, L]
            out = model(input_ids=x, labels=x)  # HF shifts labels internally
            contrib = x.size(1) - 1  # L-1 positions contribute
            total_nll += out.loss.item() * contrib  # sum NLL
            total_tokens += contrib

    avg_nll = total_nll / total_tokens
    ppl = math.exp(avg_nll)

    # Detailed prints
    print("\n=== Repro Config ===")
    print(f"model_name:       {MODEL_NAME}")
    print(f"split:            {SPLIT}")
    print(f"context_length:   {CONTEXT_LENGTH}")
    print(f"dtype:            {DTYPE}")
    print(f"device_map:       {DEVICE_MAP}")
    print(f"tokens_total:     {ids.numel()}")
    print(f"num_segments:     {n_windows}")
    print(f"bos/eos/pad:      {tok.bos_token}/{tok.eos_token}/{tok.pad_token}")

    print("\n=== Results ===")
    print(f"tokens_scored:    {total_tokens}")
    print(f"avg_nll:          {avg_nll:.6f}")
    print(f"perplexity:       {ppl:.3f}\n")


if __name__ == "__main__":
    main()

Output:

=== Repro Config ===
model_name:       openai/gpt-oss-20b
split:            test
context_length:   2048
dtype:            torch.bfloat16
device_map:       auto
tokens_total:     28672
num_segments:     14
bos/eos/pad:      <|startoftext|>/<|return|>/<|endoftext|>

=== Results ===
tokens_scored:    28658
avg_nll:          5.977535
perplexity:       394.467

Expected behavior

When evaluating openai/gpt-oss-20b on the WikiText-2 (raw) test split with a standard perplexity script, the reported perplexity is extremely high (~394). This is surprising, as a 20B parameter GPT-class model should normally achieve much lower perplexity on this benchmark.

Clarification would be helpful to determine whether this behavior indicates a bug in the Transformers integration or if GPT-OSS models are not intended to be directly evaluated as causal LMs without special formatting.

Note: The model card mentions a “harmony” chat template for usage, but it is unclear whether special formatting is required when performing perplexity evaluation on a corpus like WikiText.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions