Update dependency transformers to v4.57.1 #100
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==4.50.0->==4.57.1Release Notes
huggingface/transformers (transformers)
v4.57.1: Patch release v4.57.1Compare Source
This patch most notably fixes an issue with an optional dependency (
optax), which resulted in parsing errors withpoetry. It contains the following fixes:v4.57.0: : Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3Compare Source
New model additions
Qwen3 Next
The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:
Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.
Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost.
Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.
For more details, please visit their blog Qwen3-Next (blog post).
Vault Gemma
VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.
VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.
Qwen3 VL
Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.
Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.
These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
Longcat Flash
The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.
The abstract from the paper is the following:
We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.
Tips:
Flex Olmo
FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.
You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.
LFM2 VL
LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.
Architecture
LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:
The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.
BLT
The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
The abstract from the paper is the following:
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
Usage Tips:
Dual Model Architecture: BLT consists of two separate trained models:
Dynamic Patching: The model uses entropy-based dynamic patching where:
Local Encoder: Processes byte sequences with cross-attention to patch embeddings
Global Transformer: Processes patch-level representations with full attention across patches
Local Decoder: Generates output with cross-attention back to the original byte sequence
Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.
Qwen3 Omni MoE
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
Notes
Qwen2_5OmniForConditionalGeneration] to generate audio and text output. To generate only one output type, use [Qwen2_5OmniThinkerForConditionalGeneration] for text-only and [Qwen2_5OmniTalkersForConditionalGeneration] for audio-only outputs.Qwen2_5OmniForConditionalGeneration] supports only single batch size at the moment.processor.max_pixels. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceedsprocessor.max_pixels.~ProcessorMixin.apply_chat_template] method to convert chat messages to model inputs.Parakeet
Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.
Model Architecture
ParakeetEncoder] for the encoder implementation and details).EdgeTAM
The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.
EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.
OLMO3
More details to come soon 👀
Continuous batching
We are introducing Continuous Batching (CB) in this release, we consider it a stable feature. The main use case for CB is batched generation, which makes it very efficient in the context of GRPO training or evaluation. Thanks to CB, researchers or model developers are now free to use transformers in these contexts without having to spin up an additional inference engine.
CB currently supports both full attention and sliding window attention: this means that the vast majority of models are supported, like llama, gemma3, gpt-oss.
CB is also integrated with transformers serve, which means that you can deploy transformers as an OpenAI-compatible HTTP server.
Here is a small snippet on how to use it:
Breaking changes
check_model_inputsin core VLMs by @zucchini-nlp in #40342input_featurelength andattention_masklength inWhisperFeatureExtractorby @BakerBunker in #39221_prepare_generation_configby @manueldeprada in #40715center_cropfast equivalent to slow by @yonigozlan in #40856Bugfixes and improvements
pytest-rerunfailures<16.0by @ydshieh in #40561test_all_params_have_gradient=FalseforDeepseekV2ModelTestby @ydshieh in #40566test_eager_matches_sdpa_inferencenot run forCLIPby @ydshieh in #40581remi-ortorun-slowby @ydshieh in #40590get_*_featuresmethods + update doc snippets by @qubvel in #40555TvpImageProcessingTest::test_slow_fast_equivalenceby @ydshieh in #40593siglipflakytest_eager_matches_sdpa_inferenceby @ydshieh in #40584Tests] Fixup duplicated mrope logic by @vasqu in #40592TokenizerTesterMixintemporarily by @ydshieh in #40611transformers serveby @McPatate in #40479too many requestcaused byAutoModelTest::test_dynamic_saving_from_local_repoby @ydshieh in #40614JambaModelTest.test_load_balancing_lossby @ydshieh in #40617deepseek_v3.mdto Korean by @ssum21 in #39649too many requestsinTestMistralCommonTokenizerby @ydshieh in #40623test_prompt_lookup_decoding_matches_greedy_searchforvoxtralby @ydshieh in #40643LongformerModelTest::test_attention_outputsas flaky by @ydshieh in #40655custom_generateCallables and unify generation args structure by @manueldeprada in #40586check_determinisminsidetest_determinismby @ydshieh in #40661test_fast_is_faster_than_slowforOwlv2ImageProcessingTestby @ydshieh in #40663test_prompt_lookup_decoding_matches_greedy_searchforqwen2_audioby @ydshieh in #40664GitModelTest::test_beam_search_generateby @ydshieh in #40666tolistinstead of list comprehension calling.item()by @McPatate in #40646Aimv2ModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernelsas flaky by @ydshieh in #40683T5GemmaModelTest::test_eager_matches_sdpa_inferencebeing flaky by @ydshieh in #40702hf_hub_downloadby @ydshieh in #40710selfin post-process methods by @framonmar7 in #40711orfor grounding dino mask by @lmarshall12 in #40625Gemma Embedding] Fix SWA by @vasqu in #40700VitMatteImageProcessingTest::test_fast_is_faster_than_slowby @ydshieh in #40713request_idto headers by @McPatate in #40722and/or_mask_functionby @Cyrilvallez in #40753--continuous_batchingby @McPatate in #40618continue_final_messageinapply_chat_templateto prevent substring matching issues by @abdokaseb in #40732public.cloud.experiment_urlapi error by @Zeyi-Lin in #40763PromptLookupCandidateGeneratorwon't generate forbidden tokens by @gante in #40726test_past_key_values_formatand delete overwrites by @gante in #40701generateby @gante in #40375Jetmoe] Fix RoPE by @vasqu in #40819self.loss_functionby @qubvel in #40764test_modeling_common.pyby @gante in #40854past_key_valuesby @gante in #40803rsqrtby @thalahors in #40848VaultGemma] Update expectations in integration tests by @vasqu in #40855Configuration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
To execute skipped test pipelines write comment
/ok-to-test.This PR has been generated by MintMaker (powered by Renovate Bot).