Add discrete audio training experiments (Issue #1699) by potsawee · Pull Request #2686 · marin-community/marin

potsawee · 2026-02-06T11:33:32Z

Overview

This PR merges the marin-audio work (Issue #1699) -- investigating discrete audio language model training.

Note:

This work was developed in a long-lived branch over the past couple of months. To maintain a clean history/syncing conflicts, I've squashed the development history into a single commit on top of main.
This should fix Marin Audio #1699. I've written up a report on findings/insights from these experiments to be shared.
All changes are only experiment scripts, and they are added and contained within experiments/audio/.
For (2), there is only one exception that we add configurable BOS/EOS enforcement to tokenization (in experiments/defaults.py and ‎lib/marin/src/marin/processing/tokenize/tokenize.py) because some data was prepared with BOS and EOS added beforehand. This change was disccused in previously in https://github.com/marin-community/marin/pull/1765/changes
Credits to @Helw150 @Woodygan for some commits that have been squashed.

Audio Experiment Contents

This PR includes scripts for training, scaling laws for discrete audio, and data preparation.

1. Training Runs

We use existing Llama/Qwen3 architecture (note that we define small sizes for Qwen3 in experiments/audio/qwen3.py like 135M model) and utilize existing processed data. These experiment numbers are aligned with my report.

Script	Description
`exp1699_marin_yodas2.py`	Exp1.1: Train 600M model on 500B Yodas2 tokens (multilingual).
`exp1699_marin_yodas2_anneal.py`	Exp1.3: Annealing training for Exp1.1 with different data mixes (last 50K steps).
`exp1699_nemotron_sweep.py`	Exp1.4: Study optimal text-only (Nemotron) vs speech (YODAS2) ratio at 150M.
`exp1699_ablate_tokens.py`	Exp1.7: Ablation study on token types (semantic vs acoustic vs text).
`exp1699_marin_audio_all.py`	Exp2.1: Train all sizes (135M - 4B) from scratch for 500B tokens.
`exp1699_marin_audio_ws.py`	Exp2.2: Train 1.7B and 600M models with warm-start from Qwen3x.
`exp1699_data_mix_*.py`	Various data mixture studies at 150M size.

2. IsoFLOP Study

Scripts to generate scaling laws curves.

isoflop_audio_sweep.py: Generate sweep of model configurations at fixed FLOP budgets, immitating Will's isoflop experiment in Relative Scaling Laws.
isoflop_audio_target.py: Train specific (budget, model size) targets for curve fitting (e.g., I need more setup at some budgets)

3. Data Preparation & Utilities

Tokenization: Scripts (tokenize_*.py) included for YODAS2, Emilia, MLS (En), Nemotron, and CVSS.
Fine-Tuning: audio_sft_cvss.py implements voice-preserving S2ST fine-tuning.
Others: Includes audio_defaults.py, data mixing configs, and HuggingFace checkpoint converters.

This commit introduces the full discrete audio experiment suite, which was developed in a long-lived branch and squashed into this single commit to preserve a clean history. All changes/scripts are contained within 'experiments/audio/' to avoid impacting core infrastructure. Key additions include: - **Training Recipes**: Scripts for training models (from 135M to 4B), including annealing and warm-start strategies. - **Scaling Studies**: IsoFLOP analysis scripts for budget/size curve fitting. - **Data Pipeline**: Tokenization scripts for multilingual speech (YODAS2, Emilia, CVSS) and text-only data (Nemotron). - **Fine-Tuning**: Implementation of voice-preserving S2ST using the CVSS dataset. - **Supporting Modules**: Configs for data mixing, model definitions, and checkpoint conversion. - **README**: `experiments/audio/README.md` provides details on all these scripts Ref: marin-community#1699

- Introduce enforce_bos / enforce_eos config fields and plumb them through experiment defaults and tokenize config types (TokenizeConfig, HfTokenizeConfig). - Update tokenization path to pass these flags into preprocessor_for_format, defaulting unset values (None) to True to preserve existing behavior. - This was originally implemented and used for audio experiments, but somehow missing the prev audio commit. See: https://github.com/marin-community/marin/pull/1765/changes

- Rename tokenize_cooldown imports to tokenize_finetune (file was renamed) - Add AnnealConfig to experiments.audio.audio_defaults as experiments.anneal_config file was deleted

- Replace deprecated CheckpointConversionRequest with convert_checkpoint_to_hf_step - Create local LMMixtureDatasetConfig alias in audio_defaults.py to break circular import with levanter

…d later)

potsawee · 2026-02-06T18:40:06Z

Fixes pushed. Some experiment files were changed, so audio experiment scripts were updated/fixed to reflect these new changes.
All audio experiment scripts (experiments/audio/*) seem to pass the CI test
All CI failures are all in test_scaling_laws.py. This seems due to missing Hugging Face credentials for the gated Llama-3.1 model. Not sure if there is anything I can do with these failures.

potsawee added 5 commits February 6, 2026 02:59

Fix module import errors in audio experiments

0d2569a

- Rename tokenize_cooldown imports to tokenize_finetune (file was renamed) - Add AnnealConfig to experiments.audio.audio_defaults as experiments.anneal_config file was deleted

Fix import errors and circular dependencies in audio experiments

193d365

- Replace deprecated CheckpointConversionRequest with convert_checkpoint_to_hf_step - Create local LMMixtureDatasetConfig alias in audio_defaults.py to break circular import with levanter

Quick fix on default_audio_anneal missing model_config (this was adde…

6a81e6f

…d later)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add discrete audio training experiments (Issue #1699)#2686

Add discrete audio training experiments (Issue #1699)#2686
potsawee wants to merge 5 commits intomarin-community:mainfrom
potsawee:audio-release-pr

potsawee commented Feb 6, 2026 •

edited

Loading

Uh oh!

potsawee commented Feb 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

potsawee commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Audio Experiment Contents

1. Training Runs

2. IsoFLOP Study

3. Data Preparation & Utilities

Uh oh!

potsawee commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

potsawee commented Feb 6, 2026 •

edited

Loading

potsawee commented Feb 6, 2026 •

edited

Loading