Skip to content

feat: integrate OPUS data selection into training pipeline#548

Draft
NSR9 wants to merge 15 commits intorefactor/consolidationfrom
feat/opus-data-selection
Draft

feat: integrate OPUS data selection into training pipeline#548
NSR9 wants to merge 15 commits intorefactor/consolidationfrom
feat/opus-data-selection

Conversation

@NSR9
Copy link
Collaborator

@NSR9 NSR9 commented Feb 27, 2026

Summary

  • Integrates OPUS (arXiv:2602.05400) dynamic per-step data selection into the LLM training pipeline
  • Each training step scores N candidate samples and trains on the best rho*N, guided by AdamW preconditioner geometry
  • Implemented as a composable middleware (OpusDataSelector) with minimal changes to existing train.py and main.py
  • Fully toggleable — disabled by default (opus.enabled: false), zero overhead when off

Architecture

DataLoader (2x batch) → OpusDataSelector.select_batch() → selected batch (1x) → train_epoch (unchanged)
  • Scoring pass runs on unwrapped model_engine.module (bypasses DeepSpeed ZeRO gradient sync)
  • Uses standard CE (not FusedLinearCE) for scoring to support ghost hook gradient capture
  • Preconditioner uses DeepSpeed's official safe_get_full_optimizer_state() API
  • RandomInDistributionProxyProvider auto-resets on epoch boundaries

New files (llm/src/llm/opus/)

File Description
config.py OpusConfig dataclass with from_dict() factory
ghost.py GhostCollector — forward/backward hooks for activation+gradient capture
countsketch.py CountSketchProjector — dimensionality reduction for gradient sketches
preconditioner.py AdamWPreconditionerView — reads optimizer state for scoring geometry
selector.py OpusSelector — scoring + Boltzmann sampling selection
proxy.py Proxy data providers (random in-distribution, benchmark)
data_selector.py OpusDataSelector — composable middleware encapsulating the full pipeline
distributed.py Thin wrappers around torch.distributed

Modified files

  • llm/main.py — Parse opus: config, create candidate+proxy dataloaders, init OpusDataSelector
  • llm/src/llm/train.py — Add opus_selector param, call select_batch() before forward, refresh preconditioner after step
  • llm/configs/config.yaml — Add opus: section (disabled by default)

Test plan

  • 16 unit tests covering config, countsketch, ghost, preconditioner, selector, data_selector, public API
  • All existing logger tests pass (no regressions)
  • Import smoke test passes
  • GPU smoke test: 10 steps with OPUS enabled on single GPU
  • A/B comparison: 100 steps OPUS vs baseline loss curves

🤖 Generated with Claude Code

NSR9 and others added 15 commits February 27, 2026 06:39
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy self-contained OPUS modules from production source and add tests
for CountSketch determinism/shape and GhostCollector hook lifecycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy selector.py from OpusImplementation_updated and fix absolute
imports (production.config, production.distributed) to relative
imports (.config, .distributed). Add tests for OpusSelector init
and SelectionResult dataclass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements the data selection middleware that sits between DataLoader and
training loop, orchestrating GhostCollector scoring, CountSketch feature
building, and Boltzmann selection to pick the best rho*N samples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move OPUS selection before device transfer to avoid wasted GPU copy
- Use RandomInDistributionProxyProvider instead of raw iter() to auto-reset
  on epoch boundaries (prevents silent degradation to random selection)
- Support both ProxyProvider and raw iterator in OpusDataSelector

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… critical bug where scores from different candidates on different GPUs were being summed as if they were the same sample
Added explicit warning when candidate scores are non-finite — logs how many scores are bad, and a stronger warning when ALL scores are non-finite (full degradation to random)
…l 4 hook closures; captured activations and gradients now own their memory and cannot be corrupted by in-place ops. Tradeoff: higher memory-copy overhead each OPUS scoring pass (noticeable at large batch/seq/hidden sizes).
…terable on StopIteration between epochs, and surface fallback exceptions with type + full traceback (exc_info=True)
@pankaj1311 pankaj1311 marked this pull request as draft March 2, 2026 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants