-
Notifications
You must be signed in to change notification settings - Fork 436
Description
It may be beneficial to support the T5Gemma (and upcoming T5Gemma2) architectures. Here's the basic idea from the transformers T5Gemma documentation:
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large language models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
The upcoming T5Gemma 2 is the same idea, but based Gemma 3. Here's an overview from the transformers T5Gemma 2 documentation:
T5Gemma 2 is a family of pretrained encoder-decoder large language models with strong multilingual, multimodal and long-context capability, available in 270M-270M, 1B-1B and 4B-4B parameters. Following T5Gemma, it is built via model adaptation (based on Gemma 3) using UL2. The architecture is similar to T5Gemma and Gemma 3, enhanced with tied word embeddings and merged self- and cross-attention to save model parameters.
These architectures modernize and improve upon T5, by blending the improved preformance of modern Gemma models with the enhanced efficiency of the encoder-decoder architecture.
For reference, here are the PRs that merged model support for these architectures into transformers: