Loss masking #444

oleksost · 2026-01-09T19:11:20Z

✨ Description

Addressing #442

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

- Rename Apriel2CheckpointFormat to Apriel2TextCheckpointFormat for text-only models - Add new Apriel2CheckpointFormat for multimodal models (tabled for now) - Replace num_hidden_layers with num_blocks in decoder config (Fast-LLM convention) - Update test fixtures to use num_blocks in decoder configs - Fix stochastic mixer preprocess() to collect attention_mask from nested mixers - Add cache initialization to Apriel2GatedDeltaNet for lazy allocation - Use past_key_values (plural) consistently per HuggingFace convention - Update test code to use model.model.decoder.blocks[idx] accessor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…aches - Test 1: Empty cache vs filled cache - verifies cache is being used at all - Test 2: Corrupted cache (zeros) vs correct cache - verifies cache VALUES matter - Derive cache dimensions from actual forward pass (handles different attention configs) - Fix: original test used wrong attribute names (key_cache/value_cache instead of key/value) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Update modeling_apriel2.py to use direct dict access instead of helper methods (config.embeddings["max_position_embeddings"] instead of config.get_max_position_embeddings()) - Fix activation export in vision adapter converter to use .hf_name instead of .value for proper round-trip conversion - Fix MultiModalInferenceRunner naming in multimodal/config.py - Raise NotImplementedError for multimodal HF wrapper (not implemented) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Multimodal converter: stratified inheritance from Pixtral/LLaVA - Inherit get_converters for Attention, Block, Encoder, Adapter (shares weight conversion logic) - Standalone PatchConvolutionConverter (different paths, no meaningful sharing) - Override all import_config/export_config (different naming and nested structure) - Remove verbose docstrings and self-narrative comments from all Apriel2 files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Introduces convert_from_llava.py which converts Llava/Pixtral models (like Apriel 1.5) to Apriel2 format. The converter handles: - Config conversion from Llava to Apriel2 format - Weight mapping between different naming conventions - Vision encoder, projector, and language model weights - Support for both local paths and HuggingFace model IDs Test coverage includes: - Config conversion validation - Component-level forward pass equivalence (embeddings, vision encoder, projector, language model layers) - Full model forward pass equivalence for text-only inputs - Multimodal forward pass validation (image + text inputs) - Apriel 1.5 large model conversion test (marked as slow) Note: Multimodal numerical equivalence is not possible due to architectural differences between Pixtral and Apriel2 vision encoders (Pixtral produces (size/16)^2 - 1 patches vs Apriel2's (size/16)^2). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Refactors the Llava-to-Apriel2 converter to cleanly separate concerns: 1. **convert_from_llava.py** - Pure format conversion (Llava -> Apriel2) - Config conversion: 1-to-1 mapping of Llava config to Apriel2 format - Weight conversion: Pure name mapping, no transformations - No surgery logic - just format translation 2. **surgery.py** - Generic Apriel2 -> Apriel2 transformation - Layer-by-layer conversion using converter registry - For stochastic mixers, source is always the main mixer - Supports wrapping attention with stochastic mixer - Random initialization for incompatible conversions (e.g., attention -> mamba) 3. **converters.py** - Converter registry and implementations - Identity: forall a. a -> a - Bidirectional: attention <-> sliding_window - Random init utilities for mamba, attention, gated_delta_net Benefits: - Surgery can be applied to ANY Apriel2 model, not just converted ones - Easy to add new source formats (Qwen, Llama, etc.) - No intermediate persistence - all operations on in-memory state dicts - Cleaner code: 725 lines removed in refactor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add expr_plan.py: declarative weight transformation with composable expressions (Ref, Slice, Concat, Init, Reshape) and streaming executor - Implement MIL (Mamba Initialization from LLM) for attention->mamba surgery - Remove legacy converters.py and surgery.py (imperative approach) - Simplify convert_from_llava.py to use plan-based streaming only - Update tests to use new expr_plan API The plan system enables: - Composable conversions via plan composition (Llava->Apriel2->Modified) - Memory-efficient streaming execution with ref-counting - Declarative, inspectable transformation plans - W path builder for readable key construction 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Key changes: - Add GatedDeltaNet (DIL) conversion from attention weights - Support stochastic mixer with multiple sub-mixers (attention + mamba/GDN) - Add dt_init_floor parameter for Mamba dt_bias initialization - Fix plan tree collapsing to merge layers but not projections - Add example YAML configs for hybrid architectures The tree collapsing fix ensures that layers [0..47] are merged at the blocks level while projections (q_proj, k_proj, etc.) remain separate. This is achieved by tracking which positions vary within each group and only allowing merges when the cross-group variation matches. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Add SafetensorLoader context manager for O(1) key lookup across sharded files - Add ShardedSafetensorWriter for streaming output with configurable shard size - Update convert_from_llava.py to use streaming pipeline - Bounds peak memory to ~5GB instead of ~30GB for large models 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…erter - Split monolithic expr_plan.py into conversion/ subpackage: - expr.py: Expression DSL types (Ref, Slice, Concat, Init, Reshape) - render.py: Plan rendering and tree visualization - executor.py: Plan execution and streaming executor - io.py: SafetensorLoader and ShardedSafetensorWriter - converters.py: MIL/DIL converters and surgery planning - Move Llava-specific code into conversion/llava/: - config.py: Llava config to Apriel2 config conversion - plan.py: Llava to Apriel2 weight plan builder - Create source-format agnostic convert.py: - Registry pattern for source formats (SOURCE_FORMATS dict) - Auto-detection via detect_source_format() - Generic build_plan() and convert() functions - Update tests to use new imports and add seed=0 to execute() calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The GDN uses DIL initialization which maps attention Q/K/V/O weights to GDN projections. Only conv_kernel_size needs to be specified - other dimensions (num_value_heads, num_key_heads, head dims) are automatically derived from the source attention config. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

CLI changes: - Support multiple --surgery/-s args for chaining surgeries - Add apriel2 as source format (surgery-only mode, no conversion) - Auto-detect Apriel2 configs by model_type or decoder field New modules: - config.py: compose_configs for declarative config composition - test_compose_configs.py: Monoid laws and config composition tests - test_plan_composition_torture.py: Cycling surgeries for stochastic mixers Bug fixes: - Increase cache correctness tolerance in test_modeling (GPU precision) - Comment out GDN conv1d.bias (Qwen3NextGatedDeltaNet has bias=False) Documentation cleanup: - Remove verbose Args/Returns sections (prefer type signatures) - Condense inline comments to essential "what and why" - Remove historical context, focus on current design - Shorten function docstrings to one-liners where obvious 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…ersion

Aligns Apriel2 external HF model naming with upstream Fast-LLM's VisionEncoderConfig which renamed patch_convolution → embeddings. Changes: - Rename Apriel2PatchConvolution class to Apriel2Embeddings - Rename .conv/.norm to .patch_embeddings/.normalization - Update all weight paths and config keys - Add image_sizes support to Apriel2 for dynamic image cropping - Enable HuggingFace wrapper for multimodal models No backwards compatibility shims - clean break since no Apriel2 checkpoints exist yet. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…uite - Fix tensor contiguity issue in Apriel2Embeddings.forward that caused ~4.7e-7 numerical differences vs Pixtral. The transpose operation creates a non-contiguous tensor, and RMSNorm produces slightly different results on non-contiguous tensors due to FP computation order differences. - Add test_equivalence.py with source-of-truth isolation testing philosophy: each component is tested by using Pixtral's output as input to both models, ensuring strict 1e-6 tolerance and pinpointing exactly which component has a bug if tests fail. - Remove redundant forward-pass tests from test_convert_from_llava.py that are now covered by the comprehensive equivalence test suite. - Add model_pair fixture and various input configurations for thorough testing across different batch sizes and image configurations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

The external Apriel2 HuggingFace model removed the `.self_attn` wrapper indirection from attention layers. This updates the converters to match: - Vision encoder: `mixer.self_attn` -> `mixer` - Text decoder attention blocks: `mixer.self_attn` -> `mixer` - Stochastic mixer attention: `mixers.{name}.self_attn` -> `mixers.{name}` Without this fix, weight conversion produced warnings about unused weights at `mixer.self_attn.*` paths and uninitialized weights at `mixer.*` paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Test validates that triton=True and triton=False produce equivalent attention outputs for both FastLLM's Rotary2D and Pixtral's PixtralRotaryEmbedding implementations. Key findings: - Layout conversion between real/interleaved formats works correctly - FastLLM vs Pixtral have different frequency calculations (skipped) - Uses convert_rotary_complex_to_real/convert_rotary_real_to_complex for weight layout conversion (same as model converters) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Inlined Qwen3NextGatedDeltaNet into Apriel2GatedDeltaNet, removing external dependency - Aligned all weight names with Fast-LLM: in_proj_qkvz, in_proj_ba, convolution, out_proj, dt_bias, A_log, norm - Aligned config params with Fast-LLM: value_heads, key_heads, key_head_dim, value_head_dim - Added FLA imports with pure PyTorch fallbacks for chunk_gated_delta_rule and rms_norm_gated - Added GatedRMSNormalization class matching Fast-LLM's implementation - Fixed cache initialization to check per-mixer conv_state before using precomputed states - Fixed causal_conv1d_update tensor shape handling for single-token decode - Updated all converter paths and test fixtures to use new naming convention 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Some models (like Apriel-1.5-15b-Thinker) have head_dim != hidden_size // num_heads. The config explicitly stores head_dim, but we were computing it incorrectly. Now we check for explicit head_dim first, falling back to computation only when not present or None. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

tscholak and others added 30 commits November 27, 2025 19:24

Merge remote-tracking branch 'origin/main' into tscholak/apriel2-conv…

3eb8bfb

…ersion

add non-approximated gelu

8445aaf

remove projector_intermediate_size

aa46283

fix llava hf weight prefixes

17c9970

Merge origin/main into tscholak/apriel2-conversion

1eb07a2

fix vision tower hf prefix

6e5da16

fix intermediate size import

f260277

remove gelu_gaussian

98b6283

Fix rotary 2d

2ab1825

stuff

8305dd5

stuff

b6e38b8

handle pil images

9c4152a

oleksost added 26 commits January 5, 2026 21:11

undo skip empty buffer

b3653d0

Merge branch 'main' into oo/apriel2

3b9d367

evoid padding overlap in state loading

cbebaa8

debugging padding

b420290

debugging

d87f825

padding correction

a9d146e

remove unnecessary logging

2d23387

Revert debugging commits

80c40af

polish naming

1ce641d

test lm head

9b4e287

test ssm

7a2142d

Merge branch 'bug_fixing' into oo/apriel2

9d6d61a

tests and cross entropy loss averaging over all tokens

574b1d4

Merge branch 'add-forward-kl-evaluator' into oo/apriel2

78311c9

set test time mixer type

27ce285

progress bar

28d90de

distributed bug (fsdp)

44c9a6e

addresseing comments

95f14af

explicit z_loss grads

5ad4c0c

removed z_loss as aux loss

0a66e14

move loss configs to the lm config

f8f7041

tests

ab9c917

Merge branch 'train_only_layer_losses' into oo/apriel2

2199f51

nvm

b700470

no reference models at inference

2c27adb

add padding and image placeholder into loss mask

66078fb

oleksost requested a review from jlamypoirier January 9, 2026 19:11

oleksost closed this Jan 9, 2026

oleksost removed the request for review from jlamypoirier January 9, 2026 19:11

oleksost deleted the loss_masking branch January 9, 2026 19:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss masking #444

Loss masking #444

Uh oh!

oleksost commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Loss masking #444

Loss masking #444

Uh oh!

Conversation

oleksost commented Jan 9, 2026

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants