Add discrete audio training experiments (Issue #1699)#2686
Open
potsawee wants to merge 5 commits intomarin-community:mainfrom
Open
Add discrete audio training experiments (Issue #1699)#2686potsawee wants to merge 5 commits intomarin-community:mainfrom
potsawee wants to merge 5 commits intomarin-community:mainfrom
Conversation
This commit introduces the full discrete audio experiment suite, which was developed in a long-lived branch and squashed into this single commit to preserve a clean history. All changes/scripts are contained within 'experiments/audio/' to avoid impacting core infrastructure. Key additions include: - **Training Recipes**: Scripts for training models (from 135M to 4B), including annealing and warm-start strategies. - **Scaling Studies**: IsoFLOP analysis scripts for budget/size curve fitting. - **Data Pipeline**: Tokenization scripts for multilingual speech (YODAS2, Emilia, CVSS) and text-only data (Nemotron). - **Fine-Tuning**: Implementation of voice-preserving S2ST using the CVSS dataset. - **Supporting Modules**: Configs for data mixing, model definitions, and checkpoint conversion. - **README**: `experiments/audio/README.md` provides details on all these scripts Ref: marin-community#1699
- Introduce enforce_bos / enforce_eos config fields and plumb them through experiment defaults and tokenize config types (TokenizeConfig, HfTokenizeConfig). - Update tokenization path to pass these flags into preprocessor_for_format, defaulting unset values (None) to True to preserve existing behavior. - This was originally implemented and used for audio experiments, but somehow missing the prev audio commit. See: https://github.com/marin-community/marin/pull/1765/changes
- Rename tokenize_cooldown imports to tokenize_finetune (file was renamed) - Add AnnealConfig to experiments.audio.audio_defaults as experiments.anneal_config file was deleted
- Replace deprecated CheckpointConversionRequest with convert_checkpoint_to_hf_step - Create local LMMixtureDatasetConfig alias in audio_defaults.py to break circular import with levanter
Contributor
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PR merges the
marin-audiowork (Issue #1699) -- investigating discrete audio language model training.Note:
main.experiments/audio/.experiments/defaults.pyandlib/marin/src/marin/processing/tokenize/tokenize.py) because some data was prepared with BOS and EOS added beforehand. This change was disccused in previously in https://github.com/marin-community/marin/pull/1765/changesAudio Experiment Contents
This PR includes scripts for training, scaling laws for discrete audio, and data preparation.
1. Training Runs
We use existing Llama/Qwen3 architecture (note that we define small sizes for Qwen3 in
experiments/audio/qwen3.pylike 135M model) and utilize existing processed data. These experiment numbers are aligned with my report.exp1699_marin_yodas2.pyexp1699_marin_yodas2_anneal.pyexp1699_nemotron_sweep.pyexp1699_ablate_tokens.pyexp1699_marin_audio_all.pyexp1699_marin_audio_ws.pyexp1699_data_mix_*.py2. IsoFLOP Study
Scripts to generate scaling laws curves.
isoflop_audio_sweep.py: Generate sweep of model configurations at fixed FLOP budgets, immitating Will's isoflop experiment in Relative Scaling Laws.isoflop_audio_target.py: Train specific (budget, model size) targets for curve fitting (e.g., I need more setup at some budgets)3. Data Preparation & Utilities
tokenize_*.py) included for YODAS2, Emilia, MLS (En), Nemotron, and CVSS.audio_sft_cvss.pyimplements voice-preserving S2ST fine-tuning.audio_defaults.py, data mixing configs, and HuggingFace checkpoint converters.