Skip to content

Comments

Add discrete audio training experiments (Issue #1699)#2686

Open
potsawee wants to merge 5 commits intomarin-community:mainfrom
potsawee:audio-release-pr
Open

Add discrete audio training experiments (Issue #1699)#2686
potsawee wants to merge 5 commits intomarin-community:mainfrom
potsawee:audio-release-pr

Conversation

@potsawee
Copy link
Contributor

@potsawee potsawee commented Feb 6, 2026

Overview

This PR merges the marin-audio work (Issue #1699) -- investigating discrete audio language model training.

Note:

  1. This work was developed in a long-lived branch over the past couple of months. To maintain a clean history/syncing conflicts, I've squashed the development history into a single commit on top of main.
  2. This should fix Marin Audio #1699. I've written up a report on findings/insights from these experiments to be shared.
  3. All changes are only experiment scripts, and they are added and contained within experiments/audio/.
  4. For (2), there is only one exception that we add configurable BOS/EOS enforcement to tokenization (in experiments/defaults.py and ‎lib/marin/src/marin/processing/tokenize/tokenize.py) because some data was prepared with BOS and EOS added beforehand. This change was disccused in previously in https://github.com/marin-community/marin/pull/1765/changes
  5. Credits to @Helw150 @Woodygan for some commits that have been squashed.

Audio Experiment Contents

This PR includes scripts for training, scaling laws for discrete audio, and data preparation.

1. Training Runs

We use existing Llama/Qwen3 architecture (note that we define small sizes for Qwen3 in experiments/audio/qwen3.py like 135M model) and utilize existing processed data. These experiment numbers are aligned with my report.

Script Description
exp1699_marin_yodas2.py Exp1.1: Train 600M model on 500B Yodas2 tokens (multilingual).
exp1699_marin_yodas2_anneal.py Exp1.3: Annealing training for Exp1.1 with different data mixes (last 50K steps).
exp1699_nemotron_sweep.py Exp1.4: Study optimal text-only (Nemotron) vs speech (YODAS2) ratio at 150M.
exp1699_ablate_tokens.py Exp1.7: Ablation study on token types (semantic vs acoustic vs text).
exp1699_marin_audio_all.py Exp2.1: Train all sizes (135M - 4B) from scratch for 500B tokens.
exp1699_marin_audio_ws.py Exp2.2: Train 1.7B and 600M models with warm-start from Qwen3x.
exp1699_data_mix_*.py Various data mixture studies at 150M size.

2. IsoFLOP Study

Scripts to generate scaling laws curves.

  • isoflop_audio_sweep.py: Generate sweep of model configurations at fixed FLOP budgets, immitating Will's isoflop experiment in Relative Scaling Laws.
  • isoflop_audio_target.py: Train specific (budget, model size) targets for curve fitting (e.g., I need more setup at some budgets)

3. Data Preparation & Utilities

  • Tokenization: Scripts (tokenize_*.py) included for YODAS2, Emilia, MLS (En), Nemotron, and CVSS.
  • Fine-Tuning: audio_sft_cvss.py implements voice-preserving S2ST fine-tuning.
  • Others: Includes audio_defaults.py, data mixing configs, and HuggingFace checkpoint converters.

This commit introduces the full discrete audio experiment suite, which was developed in a long-lived branch and squashed into this single commit to preserve a clean history. All changes/scripts are contained within 'experiments/audio/' to avoid impacting core infrastructure.

Key additions include:
- **Training Recipes**: Scripts for training models (from 135M to 4B), including annealing and warm-start strategies.
- **Scaling Studies**: IsoFLOP analysis scripts for budget/size curve fitting.
- **Data Pipeline**: Tokenization scripts for multilingual speech (YODAS2, Emilia, CVSS) and text-only data (Nemotron).
- **Fine-Tuning**: Implementation of voice-preserving S2ST using the CVSS dataset.
- **Supporting Modules**: Configs for data mixing, model definitions, and checkpoint conversion.
- **README**: `experiments/audio/README.md` provides details on all these scripts

Ref: marin-community#1699
- Introduce enforce_bos / enforce_eos config fields and plumb them through experiment defaults and tokenize config types (TokenizeConfig, HfTokenizeConfig).
- Update tokenization path to pass these flags into preprocessor_for_format, defaulting unset values (None) to True to preserve existing behavior.
- This was originally implemented and used for audio experiments, but somehow missing the prev audio commit. See: https://github.com/marin-community/marin/pull/1765/changes
- Rename tokenize_cooldown imports to tokenize_finetune (file was renamed)
- Add AnnealConfig to experiments.audio.audio_defaults as experiments.anneal_config file was deleted
- Replace deprecated CheckpointConversionRequest with convert_checkpoint_to_hf_step
- Create local LMMixtureDatasetConfig alias in audio_defaults.py to break circular import with levanter
@potsawee
Copy link
Contributor Author

potsawee commented Feb 6, 2026

  • Fixes pushed. Some experiment files were changed, so audio experiment scripts were updated/fixed to reflect these new changes.
  • All audio experiment scripts (experiments/audio/*) seem to pass the CI test
  • All CI failures are all in test_scaling_laws.py. This seems due to missing Hugging Face credentials for the gated Llama-3.1 model. Not sure if there is anything I can do with these failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Marin Audio

1 participant