Memory leak in BPE when using encode from PreTrainedTokenizer

### System Info

Issue in "huggingface/transformers": "3.4.2",

### Environment/Platform

- [ ] Website/web-app
- [ ] Browser extension
- [x] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)

### Description

For more details check this issue 

https://github.com/daulet/tokenizers/issues/35

https://github.com/huggingface/tokenizers/issues/1680#issuecomment-2765645427

### Reproduction

Create a PreTrainedTokenizer from AutoTokenizer.from_pretrained 
then call encode on the PreTrainedTokenizer instances, as this uses BPE internally, and there is cache 

 const cached = this.cache.get(token);
        if (cached !== undefined) {
            return cached;
        }

 // Save the result to the cache
        this.cache.set(token, result);

 /**
     * Apply Byte-Pair-Encoding (BPE) to a given token. Efficient heap-based priority
     * queue implementation adapted from https://github.com/belladoreai/llama-tokenizer-js.
     * @param {string} token The token to encode.
     * @returns {string[]} The BPE encoded tokens.
     */
    bpe(token) 

For more details please check this 
https://github.com/huggingface/tokenizers/issues/1680#issuecomment-2765645427
https://github.com/daulet/tokenizers/issues/35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak in BPE when using encode from PreTrainedTokenizer #1282

System Info

Environment/Platform

Description

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory leak in BPE when using encode from PreTrainedTokenizer #1282

Description

System Info

Environment/Platform

Description

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions