Skip to content

gpt2 tokenizer returning undefined tokens everywhere #1313

@chonknick

Description

@chonknick

System Info

"@huggingface/transformers": "^3.5.1",
"typescript": "^5.8.3"

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

The gpt2 tokenizer is leaving weird undefined tokens everywhere in the text. Since it's a Byte-Fallback BPE tokenizer, it should be able to encode everything and yet, it's not.

Here's what I am seeing from a unit test:

Tokens: [
  15496, 11,    995,
  0,     770,   318,
  257,   1332,  286,
  262,   11241, undefined,
  13,    632,   815,
  307,   1498,  284,
  16058, 428,   2420,
  656,   4833,  5207,
  13
]

Reproduction

async function testGPT2() {
    try {
        const tokenizer = await AutoTokenizer.from_pretrained("gpt2");
        const text = "Hello, world! This is a test of the token chunker. It should be able to chunk this text into smaller pieces.";
        const encoding = await tokenizer.encode(text);
        console.log("Raw encoding:", encoding);
        console.log("Input IDs:", encoding);

        // Check for undefined in input_ids
        const hasUndefined = encoding.some(id => id === undefined);
        console.log("Input IDs contain undefined:", hasUndefined);

        const decodedText = tokenizer.decode(encoding, { skip_special_tokens: true });
        console.log("Decoded text:", decodedText);

    } catch (error) {
        console.error("Error during direct tokenizer test:", error);
    }
}

testGPT2();

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions