feat: integrate OPUS data selection into training pipeline by NSR9 · Pull Request #548 · The-School-of-AI/LLM

NSR9 · 2026-02-27T15:17:39Z

Summary

Integrates OPUS (arXiv:2602.05400) dynamic per-step data selection into the LLM training pipeline
Each training step scores N candidate samples and trains on the best rho*N, guided by AdamW preconditioner geometry
Implemented as a composable middleware (OpusDataSelector) with minimal changes to existing train.py and main.py
Fully toggleable — disabled by default (opus.enabled: false), zero overhead when off

Architecture

DataLoader (2x batch) → OpusDataSelector.select_batch() → selected batch (1x) → train_epoch (unchanged)

Scoring pass runs on unwrapped model_engine.module (bypasses DeepSpeed ZeRO gradient sync)
Uses standard CE (not FusedLinearCE) for scoring to support ghost hook gradient capture
Preconditioner uses DeepSpeed's official safe_get_full_optimizer_state() API
RandomInDistributionProxyProvider auto-resets on epoch boundaries

New files (llm/src/llm/opus/)

File	Description
`config.py`	`OpusConfig` dataclass with `from_dict()` factory
`ghost.py`	`GhostCollector` — forward/backward hooks for activation+gradient capture
`countsketch.py`	`CountSketchProjector` — dimensionality reduction for gradient sketches
`preconditioner.py`	`AdamWPreconditionerView` — reads optimizer state for scoring geometry
`selector.py`	`OpusSelector` — scoring + Boltzmann sampling selection
`proxy.py`	Proxy data providers (random in-distribution, benchmark)
`data_selector.py`	`OpusDataSelector` — composable middleware encapsulating the full pipeline
`distributed.py`	Thin wrappers around `torch.distributed`

Modified files

llm/main.py — Parse opus: config, create candidate+proxy dataloaders, init OpusDataSelector
llm/src/llm/train.py — Add opus_selector param, call select_batch() before forward, refresh preconditioner after step
llm/configs/config.yaml — Add opus: section (disabled by default)

Test plan

16 unit tests covering config, countsketch, ghost, preconditioner, selector, data_selector, public API
All existing logger tests pass (no regressions)
Import smoke test passes
GPU smoke test: 10 steps with OPUS enabled on single GPU
A/B comparison: 100 steps OPUS vs baseline loss curves

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copy self-contained OPUS modules from production source and add tests for CountSketch determinism/shape and GhostCollector hook lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copy selector.py from OpusImplementation_updated and fix absolute imports (production.config, production.distributed) to relative imports (.config, .distributed). Add tests for OpusSelector init and SelectionResult dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Implements the data selection middleware that sits between DataLoader and training loop, orchestrating GhostCollector scoring, CountSketch feature building, and Boltzmann selection to pick the best rho*N samples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Move OPUS selection before device transfer to avoid wasted GPU copy - Use RandomInDistributionProxyProvider instead of raw iter() to auto-reset on epoch boundaries (prevents silent degradation to random selection) - Support both ProxyProvider and raw iterator in OpusDataSelector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… critical bug where scores from different candidates on different GPUs were being summed as if they were the same sample

Added explicit warning when candidate scores are non-finite — logs how many scores are bad, and a stronger warning when ALL scores are non-finite (full degradation to random)

…l 4 hook closures; captured activations and gradients now own their memory and cannot be corrupted by in-place ops. Tradeoff: higher memory-copy overhead each OPUS scoring pass (noticeable at large batch/seq/hidden sizes).

…terable on StopIteration between epochs, and surface fallback exceptions with type + full traceback (exc_info=True)

NSR9 and others added 15 commits February 27, 2026 06:39

feat(opus): add OpusConfig dataclass with from_dict factory

f1e534b

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): copy core modules (ghost, countsketch, proxy, distributed)

1513b65

Copy self-contained OPUS modules from production source and add tests for CountSketch determinism/shape and GhostCollector hook lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): add AdamWPreconditionerView with DeepSpeed support

837123f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): expose public API from opus package

f4e759a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): add opus config section to config.yaml and Config parser

f08d074

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): integrate OPUS selector init and dataloader into main.py

f03c872

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(opus): add OPUS selection phase to train_epoch

a1ff61f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explicitly disable MTP during OPUS scoring

fe0896d

Changed zero2_exact_global_scoring: bool = True → False, this fixes a…

17e2977

… critical bug where scores from different candidates on different GPUs were being summed as if they were the same sample

NaN score warning-

8d49750

Added explicit warning when candidate scores are non-finite — logs how many scores are bad, and a stronger warning when ALL scores are non-finite (full degradation to random)

Improve OPUS fallback reliability: reset proxy loader from original i…

425e8b7

…terable on StopIteration between epochs, and surface fallback exceptions with type + full traceback (exc_info=True)

pankaj1311 marked this pull request as draft March 2, 2026 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: integrate OPUS data selection into training pipeline#548

feat: integrate OPUS data selection into training pipeline#548
NSR9 wants to merge 15 commits intorefactor/consolidationfrom
feat/opus-data-selection

NSR9 commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NSR9 commented Feb 27, 2026

Summary

Architecture

New files (llm/src/llm/opus/)

Modified files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants