Skip to content

Memory leak in BPE when using encode from PreTrainedTokenizer #1282

@nkalra0123

Description

@nkalra0123

System Info

Issue in "huggingface/transformers": "3.4.2",

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

For more details check this issue

daulet/tokenizers#35

huggingface/tokenizers#1680 (comment)

Reproduction

Create a PreTrainedTokenizer from AutoTokenizer.from_pretrained
then call encode on the PreTrainedTokenizer instances, as this uses BPE internally, and there is cache

const cached = this.cache.get(token);
if (cached !== undefined) {
return cached;
}

// Save the result to the cache
this.cache.set(token, result);

/**
* Apply Byte-Pair-Encoding (BPE) to a given token. Efficient heap-based priority
* queue implementation adapted from https://github.com/belladoreai/llama-tokenizer-js.
* @param {string} token The token to encode.
* @returns {string[]} The BPE encoded tokens.
*/
bpe(token)

For more details please check this
huggingface/tokenizers#1680 (comment)
daulet/tokenizers#35

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions