Skip to content

Comments

Add post_process_tokens and post_process_ids methods#1944

Open
ArthurZucker wants to merge 1 commit intomainfrom
feature/post-process-tokens
Open

Add post_process_tokens and post_process_ids methods#1944
ArthurZucker wants to merge 1 commit intomainfrom
feature/post-process-tokens

Conversation

@ArthurZucker
Copy link
Collaborator

Summary

  • Add post_process_tokens method to process tokens (strings) through post-processors
  • Add post_process_ids method to process token IDs through post-processors
  • Implement for BertProcessing, RobertaProcessing, TemplateProcessing, Sequence, and ByteLevel processors
  • Add Python bindings and comprehensive tests

Motivation

These methods enable working with token sequences directly without needing a full Encoding. This is useful for:

  • Understanding how special tokens (CLS, SEP, etc.) are added to sequences
  • Debugging tokenization pipelines
  • Educational purposes to understand post-processing behavior

Test plan

  • Rust unit tests pass (cargo test --lib)
  • Rust serialization tests pass (cargo test --test serialization)
  • Python bindings tests pass (pytest tests/bindings/test_processors.py tests/bindings/test_tokenizer.py)

🤖 Generated with Claude Code

Add two new methods to PostProcessor trait and implementations:
- post_process_tokens: processes tokens (strings) through the post-processor
- post_process_ids: processes token IDs through the post-processor

These methods enable working with token sequences directly without needing
a full Encoding. Useful for understanding how special tokens are added
without going through the full tokenization pipeline.

Implemented for: BertProcessing, RobertaProcessing, TemplateProcessing,
Sequence, and ByteLevel processors.

Includes Python bindings and tests.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants