Skip to content

SourikAdhikary/AudioTokenLab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AudioTokenLab

Python License: MIT Benchmark: Modal L4 Tokenizer: EnCodec Datasets: LibriSpeech+ Run: 100 clips Broader Run: 75 clips Held-out Selector: 30 clips

Audio-token compression benchmarks for speech and voice-model infrastructure.

AudioTokenLab measures how much discrete audio-token streams can be compressed before speech quality breaks. It is built around the practical serving question behind audio LMs, voice agents, speech-to-speech systems, and TTS:

Can we reduce audio-token memory and latency without destroying intelligibility or speaker identity?

The current benchmark uses EnCodec 24 kHz tokens, reconstructs compressed audio, and evaluates the result with ASR WER/CER, SpeechBrain speaker similarity, reconstruction metrics, and KV-cache estimates.

The repo now also includes the next-stage research hooks:

  • broader multi-corpus speech manifests beyond LibriSpeech
  • VAD-aware and linear learned-selector token retention strategies
  • subjective listening-study sheets for human ratings
  • serving-stack reports for transformer prefill/KV-cache tradeoffs, with an optional PyTorch microbenchmark

Current Result

The latest run is a 100-clip LibriSpeech dev-clean benchmark on Modal L4:

Dataset: LibriSpeech dev-clean
Clips: 100
Speakers: 40
Chapters: 97
Tokenizer: EnCodec 24 kHz, 6 kbps target bandwidth
ASR evaluator: faster-whisper tiny.en
Speaker evaluator: SpeechBrain ECAPA
Strategy Token Reduction Mean WER WER 95% CI Speaker Sim KV Savings
baseline 0.00% 9.39% 6.83%-12.40% 1.000 0.00 MB
uniform 49.94% 36.72% 31.09%-43.30% 0.527 230.17 MB
acoustic_salience 49.94% 14.77% 11.75%-18.13% 0.824 230.17 MB
energy_tuned_e4_t1_o2 49.94% 14.98% 12.24%-18.20% 0.831 230.17 MB
patch 74.91% 99.72% 99.29%-100.00% 0.019 345.27 MB

The important comparison is not baseline vs compressed audio. It is uniform dropping vs salience-based dropping at the same token budget:

  • Uniform 2x frame dropping gets roughly 50% token reduction, but WER jumps to 36.72%.
  • Acoustic salience keeps the same roughly 50% token reduction, but WER is 14.77%.
  • The tuned energy variant has similar WER at 14.98% and the best compressed speaker similarity at 0.831.

See the full report: REPORT.md

Broader Speech Result

The broader benchmark runs the same EnCodec pipeline on 75 clips across three sources:

LibriSpeech dev-clean: 25 clips
MInDS-14 en-US:        25 clips
FLEURS en_us:          25 clips
Strategies:            7
Evaluated samples:     525
Modal run:             ap-GvQq49rJPkHf3SMC3joT5H
Strategy Token Reduction Mean WER Speaker Sim KV Savings
baseline 0.00% 36.05% 1.000 0.00 MB
uniform 49.95% 60.61% 0.460 232.97 MB
acoustic_salience 49.95% 49.39% 0.792 232.97 MB
energy_salience 49.95% 49.05% 0.796 232.97 MB
linear_selector_v1 49.95% 50.38% 0.792 232.97 MB
vad_salience 49.95% 51.15% 0.792 232.97 MB
patch 74.92% 99.55% 0.043 349.47 MB

This mixed-domain run is harder than LibriSpeech for tiny.en, so absolute WER is higher. The important signal is still the same-budget comparison: energy_salience keeps roughly 50% token reduction while improving WER by 11.56 percentage points over uniform dropping and preserving much more speaker identity.

Held-Out Trained Selector Result

trained_selector_v1 was distilled from the 75-clip broader run, then evaluated on a fresh held-out slice that skips the first 25 valid clips per source and uses a stronger ASR evaluator:

LibriSpeech dev-clean: 10 held-out clips
MInDS-14 en-US:        10 held-out clips
FLEURS en_us:          10 held-out clips
Strategies:            8
Evaluated samples:     240
ASR evaluator:         faster-whisper small.en
Modal run:             ap-OaT2jqTQV5pGMO8mLoe16i
Strategy Token Reduction Mean WER Speaker Sim KV Savings
baseline 0.00% 38.31% 1.000 0.00 MB
uniform 49.95% 49.93% 0.466 228.25 MB
energy_salience 49.95% 48.91% 0.817 228.25 MB
trained_selector_v1 49.95% 48.10% 0.821 228.25 MB
patch 74.94% 99.87% 0.055 342.43 MB

On this held-out slice, trained_selector_v1 is the best 50% reduction strategy by WER and speaker similarity. The confidence intervals are still wide because this is a compact credit-aware run, but it closes the main validation gap: the trained selector is no longer only a fitted artifact plus smoke test.

Result Artifacts

Tracked, repo-visible artifacts:

Generated full run artifacts are intentionally ignored from git and written under modal-runs/.

The broader-speech smoke artifact is intentionally small: 3 clips across LibriSpeech, MInDS-14, and FLEURS; 7 strategies; 21 reconstructed/evaluated samples. Its purpose is to verify the pipeline and artifact generation.

What It Measures

AudioTokenLab reports:

  • token count and token reduction
  • estimated transformer KV-cache footprint and savings
  • encode/decode runtime and real-time factor
  • reconstruction MSE, MAE, SNR, and duration drift
  • downstream ASR WER/CER with bootstrap confidence intervals
  • speaker similarity against baseline reconstruction
  • failure cases with transcript and audio examples

The goal is to make tradeoffs obvious: memory savings are only useful if the resulting audio remains intelligible and voice-preserving.

How It Works

audio dataset
  -> audio tokenizer
  -> compression strategy
  -> reconstruction
  -> ASR + speaker + signal evaluation
  -> CSV / JSON / HTML / chart artifacts

The main neural-codec benchmark uses EnCodec. The repo also includes dependency-light dummy and mu-law tokenizers for fast local tests.

Compression Strategies

Strategy Description
baseline No compression.
uniform Keep every Nth EnCodec frame. Simple and cheap, but destructive.
acoustic_salience Keep the frame with the strongest local RVQ-token transition inside each window, then repeat-fill the decode timeline.
energy_salience Combine token-transition score with frame-energy/onset cues.
energy_tuned_e4_t1_o2 Tuned energy-salience variant from the 100-clip run.
vad_salience Uses frame-energy speech activity, short-run filtering, and hangover to keep likely speech/onset frames.
linear_selector_v1 Linear frame selector hook over energy, onset, transition, and speech-activity features. Weights can be swapped for trained weights.
patch Average codec IDs across frame windows. Kept as a failure baseline because arithmetic over discrete codec IDs is not meaningful.

Installation

Create a local editable install:

python3 -m pip install -e .

Optional extras:

python3 -m pip install -e '.[encodec]'
python3 -m pip install -e '.[modal]'
python3 -m pip install -e '.[asr]'
python3 -m pip install -e '.[speaker]'
python3 -m pip install -e '.[datasets]'
python3 -m pip install -e '.[serving]'

For the full Modal benchmark, you need Modal configured:

modal setup

Local Usage

Dependency-free demo:

audiotokenlab profile --config experiments/configs/demo.json
audiotokenlab report runs/demo

Quiet-segment workload:

audiotokenlab profile --config experiments/configs/quiet_demo.json
audiotokenlab report runs/quiet_demo

Mu-law tokenizer baseline:

audiotokenlab profile --config experiments/configs/mulaw_demo.json
audiotokenlab report runs/mulaw_demo

Optional EnCodec local run:

python3 -m pip install -e '.[encodec]'
audiotokenlab profile --config experiments/configs/encodec_demo.json
audiotokenlab report runs/encodec_demo

Modal Benchmarks

Small EnCodec smoke run:

modal run modal_app.py

Synthetic speech plus ASR smoke run:

modal run modal_app.py --speech-asr

Full tuned LibriSpeech benchmark:

modal run modal_app.py --librispeech-asr --max-clips 100 --strategy-set tuned

This downloads LibriSpeech dev-clean on Modal, selects clips across speakers/chapters, converts FLAC to 24 kHz mono WAV, runs EnCodec compression/reconstruction, evaluates ASR, computes speaker similarity, and writes local artifacts under modal-runs/encodec_librispeech_asr/.

Expected full-run artifacts:

modal-runs/encodec_librispeech_asr/
  manifest.json
  metrics.csv
  dashboard.html
  asr_metrics.csv
  asr_summary.json
  speaker_metrics.csv
  speaker_summary.json
  publication_summary.json
  summary_chart.svg
  listening_examples.md
  samples/
    *.wav

Broader speech benchmark:

modal run modal_app.py --broader-speech-asr --max-clips-per-source 25 --strategy-set extended --serving-microbench

This builds one manifest from LibriSpeech plus public Hugging Face speech corpora such as MInDS-14 and FLEURS, then runs the same EnCodec + ASR + speaker pipeline. The corpus builder is source-configurable, so you can point it at TED-LIUM, Common Voice, VoxPopuli, or internal manifests when access is available. Upstream dataset access and licenses remain governed by each provider.

Small smoke variant:

modal run modal_app.py --broader-speech-asr --max-clips-per-source 1 --strategy-set extended

Fit selector weights from an evaluated run:

PYTHONPATH=src python3 -m audiotokenlab train-selector \
  modal-runs/encodec_broader_speech_asr \
  --output experiments/results/encodec_broader_speech_asr_modal_2026-06-16_trained_selector.json

Evaluate the trained selector alongside the extended baselines:

modal run modal_app.py \
  --broader-speech-asr \
  --max-clips-per-source 10 \
  --skip-clips-per-source 25 \
  --strategy-set trained \
  --asr-model small.en \
  --serving-microbench

Remote held-out validation for strategy_set=trained: ap-OaT2jqTQV5pGMO8mLoe16i. Earlier smoke validation: ap-zN30fQb7Yz3jV025WF9ssr.

The serving report consumes metrics.csv and writes:

serving_stack_report.json
serving_stack_report.md

It estimates transformer prefill attention work, decode KV-read reduction, and KV-cache savings. With --serving-microbench, it also runs a reference PyTorch transformer layer on representative token lengths.

Subjective listening artifacts are generated with publication artifacts:

listening_study.csv
listening_study.md
listening_study.json

The CSV is an anonymized rating sheet for MOS, intelligibility, speaker match, and artifact notes.

After ratings are filled in:

PYTHONPATH=src python3 -m audiotokenlab summarize-listening \
  modal-runs/encodec_broader_speech_asr/listening_study.csv

This writes listening_study_rating_summary.json and listening_study_rating_summary.md.

Repository Layout

src/audiotokenlab/
  compression.py          compression strategies
  profiling.py            per-clip benchmark metrics
  asr_eval.py             WER/CER evaluation and bootstrap CIs
  speaker_eval.py         SpeechBrain speaker-similarity evaluation
  publication.py          chart and listening-example artifacts
  listening_study.py      subjective rating sheets
  serving.py              transformer-serving estimates and optional torch microbenchmarks
  corpora.py              broader speech dataset manifest builders
  reporting.py            CSV/JSON/HTML reporting
  tokenizers/             dummy, mu-law, and EnCodec tokenizer adapters

experiments/
  configs/                local run configs
  results/                tracked benchmark summaries and public artifacts

modal_app.py              Modal GPU benchmark entrypoint
REPORT.md                 current benchmark report

Why This Matters

Audio models pay for long audio contexts in tokens, just like text models do. Codec-token streams can be dense, and in autoregressive or transformer-style audio systems they affect:

  • prefill latency
  • decode latency
  • KV-cache memory
  • batching efficiency
  • serving cost
  • time-to-first-audio

Naively dropping audio tokens saves memory but can erase phonetics, timing, speaker identity, and prosody. AudioTokenLab is a measurement harness for finding better quality/cost tradeoffs.

Research Context

AudioTokenLab is motivated by recent work on efficient audio-token modeling and long-form speech systems:

This repo is not trying to reproduce those systems end to end. It builds the measurement layer around the bottleneck they expose: audio-token cost.

Current Limitations

  • The tracked headline LibriSpeech and broader runs are engineering benchmarks, not publication-grade estimates.
  • The held-out trained-selector run is compact and credit-aware at 30 clips, so confidence intervals are still wide.
  • ASR uses faster-whisper evaluators, which are practical regression tests but not oracles for speech quality.
  • Speaker similarity uses one pretrained embedding model.
  • trained_selector_v1 is distilled from strategy-level benchmark outcomes, not trained from frame-level labels.
  • Listening-study sheets and summarization are implemented, but human ratings are not included in the repo.
  • KV-cache savings are architecture estimates, not measurements from a deployed audio transformer.

Roadmap

Completed:

  • Local profiling pipeline
  • Dummy and mu-law tokenizer baselines
  • Optional EnCodec backend
  • Modal GPU benchmark path
  • LibriSpeech real-speech benchmark
  • ASR WER/CER with bootstrap confidence intervals
  • SpeechBrain speaker similarity
  • Salience-based compression baselines
  • Energy-salience tuning run
  • Public report, chart, and listening examples
  • Broader multi-corpus benchmark across LibriSpeech, MInDS-14, and FLEURS
  • VAD-aware selector baseline
  • Linear learned-selector integration point
  • Trained selector weight artifact distilled from ASR and speaker outcomes
  • Held-out trained-selector benchmark with faster-whisper small.en
  • Subjective listening-study artifact generation
  • Subjective listening-study rating summarizer
  • Transformer serving-stack report and CUDA microbenchmark

Next:

  • Collect human listening ratings and summarize MOS/intelligibility/speaker-match results
  • Integrate a production-grade audio-token transformer or voice-agent serving loop
  • Add more licensed or internal speech corpora through the source-configurable corpus builder

Tests

PYTHONPATH=src python3 -m unittest discover tests

License

This project is released under the MIT License.

Dataset and model licenses remain governed by their upstream providers.

About

Audio-token compression benchmark for speech ML infra: EnCodec + ASR/Speaker evals, Modal GPU runs, salience/trained selectors, KV-cache serving analysis, and public benchmark artifacts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages