Trained a 0.96 million parameters Urdu Gemma.
- Gemma Paper: https://arxiv.org/abs/2503.19786 - Core architecture and design principles
- RMSNorm: https://arxiv.org/abs/1910.07467 - Root Mean Square Layer Normalization
- RoPE: https://arxiv.org/abs/2104.09864 - Rotary Position Embedding methodology
- Grouped Query Attention: https://arxiv.org/abs/2305.13245 - Memory efficient attention mechanism
- SwiGLU/GELU: https://arxiv.org/abs/2002.05202 - Gated linear unit activations
A version of Google's Gemma architecture with the following components as defined in GemmaConfig
:
- GemmaAttention: Multi-head attention with grouped query attention (num_queries_per_kv), RoPE positional embeddings via
apply_rotary_emb()
, and causal masking using pre-computed triangular mask - GemmaMLP: Feed-forward network with GELU activation implementing gate_proj * up_proj gating mechanism through down_proj
- GemmaDecoderLayer: Transformer block combining self_attn and mlp with pre-normalization using RMSNorm
- RMSNorm: Root Mean Square Layer Normalization with optional unit offset (add_unit_offset=True) and learnable weight parameter
- tinyGemma: Complete model with embedder scaled by sqrt(hidden_size) and tied weights for language modeling head
Achieved convergence on Urdu corpus with the following performance metrics:
Final Training Metrics (5000 iterations):
- Training Loss: 2.7668
- Validation Loss: 2.9250
- Validation Perplexity: 18.6348
- Learning Rate: 3e-4 with AdamW optimizer
- Batch Size: 16 with 2 gradient accumulation steps
MIT License