-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Closed
Labels
Description
System Info
transformers
version: 4.56.1- Platform: Linux-6.5.0-1025-gcp-x86_64-with-glibc2.35
- Python version: 3.11.10
- Huggingface_hub version: 0.35.0
- Safetensors version: 0.6.2
- Accelerate version: 1.10.1
- Accelerate config: not found
- DeepSpeed version: 0.17.3+cu126.pt27.v0.17.3.recogni2
- PyTorch version (accelerator?): 2.7.1+cu126 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA A100-SXM4-40GB
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Script:
#!/usr/bin/env python
import math
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
# Config
MODEL_NAME = "openai/gpt-oss-20b"
SPLIT = "test" # WikiText-2 (raw) test split
CONTEXT_LENGTH = 2048 # evaluation window size
DTYPE = torch.bfloat16
DEVICE_MAP = "auto"
def main():
# Load tokenizer & model
tok = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=DTYPE, device_map=DEVICE_MAP).eval()
# Load dataset and build one long token stream (no special tokens)
ds = load_dataset("wikitext", "wikitext-2-raw-v1", split=SPLIT)
encs = tok([row["text"] for row in ds], add_special_tokens=False)
flat_ids = [tid for seq in encs["input_ids"] for tid in seq]
ids = torch.tensor(flat_ids, dtype=torch.long)
# Keep first 10% of tokens
n_keep = max(1, int(0.10 * ids.numel()))
ids = ids[:n_keep]
# Keep only full CONTEXT_LENGTH windows
n_windows = ids.numel() // CONTEXT_LENGTH
if n_windows == 0:
raise ValueError(f"Not enough tokens ({ids.numel()}) for a single {CONTEXT_LENGTH}-token window.")
ids = ids[: n_windows * CONTEXT_LENGTH].view(n_windows, CONTEXT_LENGTH)
# Forward passes
total_nll, total_tokens = 0.0, 0
with torch.no_grad():
for i in range(n_windows):
x = ids[i : i + 1].to(model.device) # [1, L]
out = model(input_ids=x, labels=x) # HF shifts labels internally
contrib = x.size(1) - 1 # L-1 positions contribute
total_nll += out.loss.item() * contrib # sum NLL
total_tokens += contrib
avg_nll = total_nll / total_tokens
ppl = math.exp(avg_nll)
# Detailed prints
print("\n=== Repro Config ===")
print(f"model_name: {MODEL_NAME}")
print(f"split: {SPLIT}")
print(f"context_length: {CONTEXT_LENGTH}")
print(f"dtype: {DTYPE}")
print(f"device_map: {DEVICE_MAP}")
print(f"tokens_total: {ids.numel()}")
print(f"num_segments: {n_windows}")
print(f"bos/eos/pad: {tok.bos_token}/{tok.eos_token}/{tok.pad_token}")
print("\n=== Results ===")
print(f"tokens_scored: {total_tokens}")
print(f"avg_nll: {avg_nll:.6f}")
print(f"perplexity: {ppl:.3f}\n")
if __name__ == "__main__":
main()
Output:
=== Repro Config ===
model_name: openai/gpt-oss-20b
split: test
context_length: 2048
dtype: torch.bfloat16
device_map: auto
tokens_total: 28672
num_segments: 14
bos/eos/pad: <|startoftext|>/<|return|>/<|endoftext|>
=== Results ===
tokens_scored: 28658
avg_nll: 5.977535
perplexity: 394.467
Expected behavior
When evaluating openai/gpt-oss-20b
on the WikiText-2 (raw) test split with a standard perplexity script, the reported perplexity is extremely high (~394). This is surprising, as a 20B parameter GPT-class model should normally achieve much lower perplexity on this benchmark.
Clarification would be helpful to determine whether this behavior indicates a bug in the Transformers integration or if GPT-OSS models are not intended to be directly evaluated as causal LMs without special formatting.
Note: The model card mentions a “harmony” chat template for usage, but it is unclear whether special formatting is required when performing perplexity evaluation on a corpus like WikiText.