Achieving result consistency with Huggingface's implementation with manually patched kernels #1923

dibbla · 2025-10-13T22:24:12Z

dibbla
Oct 13, 2025

Hi guys!

First, thank you for creating and sharing FlashInfer. I've experienced its impressive performance benefits firsthand through frameworks like vLLM and SGLang.

For my research, I have a specific requirement to maintain output consistency with the standard Huggingface transformers implementation. As expected with highly optimized kernels, I've observed generation differences between the results from inference engines and the baseline.

My goal is to leverage the kernel speed-ups while ensuring the generation is perfectly reproducible with the Huggingface reference. I understand this is a significant challenge.

Could you offer any guidance on this? For example, are there specific kernels, compilation flags, or configurations within FlashInfer that are designed to prioritize numerical precision and consistency with standard libraries over absolute maximum performance? My (very naive) plan is to find a way to manully patch my LLM and test out different kernels.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashInfer

Achieving result consistency with Huggingface's implementation with manually patched kernels #1923

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

FlashInfer

Achieving result consistency with Huggingface's implementation with manually patched kernels #1923

Uh oh!

dibbla Oct 13, 2025

Replies: 0 comments

dibbla
Oct 13, 2025