-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
"@huggingface/transformers": "^3.5.1",
"typescript": "^5.8.3"
Environment/Platform
- Website/web-app
- Browser extension
- Server-side (e.g., Node.js, Deno, Bun)
- Desktop app (e.g., Electron)
- Other (e.g., VSCode extension)
Description
The gpt2 tokenizer is leaving weird undefined tokens everywhere in the text. Since it's a Byte-Fallback BPE tokenizer, it should be able to encode everything and yet, it's not.
Here's what I am seeing from a unit test:
Tokens: [
15496, 11, 995,
0, 770, 318,
257, 1332, 286,
262, 11241, undefined,
13, 632, 815,
307, 1498, 284,
16058, 428, 2420,
656, 4833, 5207,
13
]
Reproduction
async function testGPT2() {
try {
const tokenizer = await AutoTokenizer.from_pretrained("gpt2");
const text = "Hello, world! This is a test of the token chunker. It should be able to chunk this text into smaller pieces.";
const encoding = await tokenizer.encode(text);
console.log("Raw encoding:", encoding);
console.log("Input IDs:", encoding);
// Check for undefined in input_ids
const hasUndefined = encoding.some(id => id === undefined);
console.log("Input IDs contain undefined:", hasUndefined);
const decodedText = tokenizer.decode(encoding, { skip_special_tokens: true });
console.log("Decoded text:", decodedText);
} catch (error) {
console.error("Error during direct tokenizer test:", error);
}
}
testGPT2();Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working