-
Notifications
You must be signed in to change notification settings - Fork 341
[Bugfix] Make Gelu Activations consistent across frameworks #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Okay so after some more digging, it seems one of the main reasons to not change this would be speed of huggingface/candle#1062 |
|
I was able to address the issue with latency by raising this PR. huggingface/candle#3168 |
|
Wow awesome work huggingface/candle#3168 @vrdn-23! I thought the compiler uses constant propagation, so I roughly assume that TEI uses the approximate gelu (~= new gelu) for faster inference, while there's a marginal difference between the variants. We could benchmark the speed by updating the One minor point for discussion is that, if the latency stays comparable, it might make sense to keep this implementation. Anyway, looks great to me! I’d appreciate any feedback or thoughts you have! |
|
Sorry for the late response, but I was away for Thanksgiving break @kozistr !
I think that might have been true previously, but I think now that huggingface/candle#3168 has been merged, the gelu_erf (old gelu) implementation is actually faster than the new one. It would also be functionally most similar in producing outputs with the existing models, so since it seems to be a win in terms of both quality and latency, I would argue that maybe we stick to the consistent implementation across frameworks. Would love to hear your thoughts @alvarobartt @Narsil |
This commit adapts text-embeddings-inference for NVIDIA Jetson Orin (SM87) and L4 GPU (SM89), and integrates valuable community PRs. Changes: 1. SM87/SM89 CUDA Support - Added compute capability 8.7 and 8.9 support - Modified Dockerfile-cuda-all for multi-arch builds - Updated compute_cap.rs for SM87/89 detection Files: Dockerfile-cuda-all, cuda-all-entrypoint.sh, compute_cap.rs 2. PR huggingface#730: Qwen3 Reranker Support - Added classification head for Qwen3 reranking - Implemented template formatting system for chat-based reranking Files: models/qwen3.rs, core/templates.rs, core/lib.rs 3. PR huggingface#787: Batch Notification Performance Optimization - Implemented AtomicUsize counter for batch processing - Reduced unnecessary notify_one() calls - Only last request in batch triggers thread notification Files: core/infer.rs, router/http/server.rs, router/grpc/server.rs 4. PR huggingface#753: GeLU Activation Consistency Fix - Changed Gelu from approximate (gelu) to exact (gelu_erf) - Added NewGelu variant for backward compatibility Files: layers/linear.rs 5. PR huggingface#790: StaticEmbedding Model Support - Added support for 0_StaticEmbedding/ directory structure - Implemented fallback loading for model weights and tokenizer - Default to Mean pooling for StaticEmbedding models Files: models/static_embedding.rs (new), lib.rs, download.rs, router/lib.rs 6. PR huggingface#746: DebertaV2 Sequence Classification Support - Complete DebertaV2 model implementation - Support for sequence classification tasks (e.g., Llama Prompt Guard) - CPU and CUDA device support Files: models/debertav2.rs (new), lib.rs, models/mod.rs All changes have been tested and compile successfully with: cargo check --all-targets Compilation verified with CUDA support: cargo install --path router -F candle-cuda Target Hardware: NVIDIA Jetson Orin AGX (SM87), L4 GPU (SM89) Date: January 5, 2026
|
Just wanted to add a note that #784 should probably be merged in before this (if accepted), so that the speed-up gained by huggingface/candle#3168 can be utilized |
What does this PR do?
This PR fixes a consistency issue with how TEI handles GeLU activation compared to the
transformerslibrary and thecandlelibrary.It seems that the value
geluis meant to serialize to an old incorrect version of how GeLU activation (based on the comment given here) was implemented based on this code snippet in transformers.This means that any config that uses the value
gelufor thehidden_activationusing theGeluActivationfunction which uses thetorch.erffunction. The new GeLU activation is referenced usingnew_geluorgelu_pytorch_tanh.This behavior is also what is followed by the huggingface/candle repository here (
gelucorresponds toxs.gelu_erf()and notxs.gelu())This PR brings the TEI implementation in line with how transformers parses the
config.jsonvalues and howcandleresolves activations.I came across this inconsistency while I was reviewing some of the code changes I had in #746, but thought this should be opened as a separate PR, given that it will slight vary (re: correct) existing model behavior. (h/t to @bbaldino for pointing this out to me)
Please do let me know if I'm missing something obvious here as to why TEI is not in-sync with how the activation functions are defined. My understanding is that this is just a bug that got carried over from legacy code that was introduced in #41
Before submitting
instasnapshots?Who can review?
@Narsil OR @alvarobartt OR @kozistr