-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Open
Labels
Description
System Info
transformersversion:5.0.0.dev0- Platform:
Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39 - Python version:
3.12.3 huggingface_hubversion:1.3.2safetensorsversion:0.7.0accelerateversion:1.12.0- Accelerate config:
not installed - DeepSpeed version:
not installed - PyTorch version (accelerator?):
2.9.1+cu128 (CUDA) - GPU type:
NVIDIA GeForce RTX 4060 Laptop GPU
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from transformers import SwitchTransformersConfig, SwitchTransformersModel
config = SwitchTransformersConfig(
num_layers=1,
num_sparse_encoder_layers=0,
num_decoder_layers=1,
num_sparse_decoder_layers=0,
vocab_size=100,
d_model=64,
d_ff=128,
num_heads=4,
d_kv=16
)
model = SwitchTransformersModel(config)
encoder_sparse_count = sum(
1 for block in model.encoder.block if block.is_sparse
)
print(f"Encoder sparse layers: {encoder_sparse_count}")The bug is in configuration_switch_transformers.py (lines 151 and 157). When num_sparse_encoder_layers=0, the code sets encoder_sparse_step = num_layers (marked with a HACK comment). Combined with the modeling logic in line 668 β when num_layers=1, sparse_step=1 triggers the sparse_step==1 condition, and incorrectly creates a sparse layer.
Expected behavior
When num_sparse_encoder_layers=0 is set, zero sparse layers should be created regardless of num_layers value.