feat: integrate OPUS data selection into training pipeline#548
Draft
NSR9 wants to merge 15 commits intorefactor/consolidationfrom
Draft
feat: integrate OPUS data selection into training pipeline#548NSR9 wants to merge 15 commits intorefactor/consolidationfrom
NSR9 wants to merge 15 commits intorefactor/consolidationfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy self-contained OPUS modules from production source and add tests for CountSketch determinism/shape and GhostCollector hook lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy selector.py from OpusImplementation_updated and fix absolute imports (production.config, production.distributed) to relative imports (.config, .distributed). Add tests for OpusSelector init and SelectionResult dataclass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implements the data selection middleware that sits between DataLoader and training loop, orchestrating GhostCollector scoring, CountSketch feature building, and Boltzmann selection to pick the best rho*N samples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move OPUS selection before device transfer to avoid wasted GPU copy - Use RandomInDistributionProxyProvider instead of raw iter() to auto-reset on epoch boundaries (prevents silent degradation to random selection) - Support both ProxyProvider and raw iterator in OpusDataSelector Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… critical bug where scores from different candidates on different GPUs were being summed as if they were the same sample
Added explicit warning when candidate scores are non-finite — logs how many scores are bad, and a stronger warning when ALL scores are non-finite (full degradation to random)
…l 4 hook closures; captured activations and gradients now own their memory and cannot be corrupted by in-place ops. Tradeoff: higher memory-copy overhead each OPUS scoring pass (noticeable at large batch/seq/hidden sizes).
…terable on StopIteration between epochs, and surface fallback exceptions with type + full traceback (exc_info=True)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
OpusDataSelector) with minimal changes to existingtrain.pyandmain.pyopus.enabled: false), zero overhead when offArchitecture
model_engine.module(bypasses DeepSpeed ZeRO gradient sync)safe_get_full_optimizer_state()APIRandomInDistributionProxyProviderauto-resets on epoch boundariesNew files (llm/src/llm/opus/)
config.pyOpusConfigdataclass withfrom_dict()factoryghost.pyGhostCollector— forward/backward hooks for activation+gradient capturecountsketch.pyCountSketchProjector— dimensionality reduction for gradient sketchespreconditioner.pyAdamWPreconditionerView— reads optimizer state for scoring geometryselector.pyOpusSelector— scoring + Boltzmann sampling selectionproxy.pydata_selector.pyOpusDataSelector— composable middleware encapsulating the full pipelinedistributed.pytorch.distributedModified files
llm/main.py— Parseopus:config, create candidate+proxy dataloaders, initOpusDataSelectorllm/src/llm/train.py— Addopus_selectorparam, callselect_batch()before forward, refresh preconditioner after stepllm/configs/config.yaml— Addopus:section (disabled by default)Test plan
🤖 Generated with Claude Code