C and C++ bindings to Tokenizers#1888
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
ArthurZucker
left a comment
There was a problem hiding this comment.
very open to this! let's make sure we have a big compat with expectations in terms of the funcs we bind
CPP bindings coverage improvement
fixed benchmarks
- Introduced a new submodule for Jinja2Cpp to handle chat template rendering. - Enhanced the C++ bindings to load and apply chat templates from a configuration file. - Added methods to retrieve special tokens and their IDs from the tokenizer configuration. - Updated the CMake configuration to include Jinja2Cpp and link it with the tokenizers_cpp library. - Refactored tests to validate the new chat template functionality and special token handling.
Support chat template in c++ api
|
Hi @ArthurZucker I've made some progress with c++ adding more APIs. I tried integrating https://github.com/jinja2cpp/Jinja2Cpp/ at c++ bindings layer, but some features are limited and lead to crash (IIRC, negative index like Any tips/recommendation for handling templates natively would be great. |
add chat template jinja rendering with minijinja
|
Upon digging https://github.com/huggingface/text-generation-inference codebase, I found it's using minijina in Rust to render chat templates (with some workarounds for unsupported features). So, I've removed jinja rendering at c++ bindings layer, instead moved it to tokenizer core in Rust, using minijinja2. This way, all bindings can access the functionality. Disclaimer: I'm not proficient in Rust, and most of the code is done by AI agents (though I've tried to closely supervise it/them). Based on my testing, everything seems to work (at least for my usecase and its tests pass). |
|
I will drop this more as a FYI: check https://github.com/mlc-ai/tokenizers-cpp There's an existing C++ bindings from Tokenizers that goes through the same Rust -> C -> C++ path. The C++ code also binds more than tokenizers because it includes sentencepiece, but that can be cut off. I think if you want a more battle tested code, fork tokenizers-cpp/rust into your current Once that is done, fork tokenizers_cpp.h and huggingface_tokenizer.cc into The LICENSE shouldn't be an issue. And I think https://chat.webllm.ai/ has a live deployment of the tokenizer with WASM. |
|
Hi! Is there any plan to merge this PR? I’m currently looking for tokenizers implemented in C. |
|
Actually yeah but I need to take over a little and check a few things but planned for sure |
Skip unknown fields in deserialization for experimental wrappers.
Adding in bindings for two more languages!
bindings/cppbindings/cC is an intermediate step to bind C++ and Rust: i.e., C++ <--> C <--> Rust.
--