Audio-token compression benchmarks for speech and voice-model infrastructure.
AudioTokenLab measures how much discrete audio-token streams can be compressed before speech quality breaks. It is built around the practical serving question behind audio LMs, voice agents, speech-to-speech systems, and TTS:
Can we reduce audio-token memory and latency without destroying intelligibility or speaker identity?
The current benchmark uses EnCodec 24 kHz tokens, reconstructs compressed audio, and evaluates the result with ASR WER/CER, SpeechBrain speaker similarity, reconstruction metrics, and KV-cache estimates.
The repo now also includes the next-stage research hooks:
- broader multi-corpus speech manifests beyond LibriSpeech
- VAD-aware and linear learned-selector token retention strategies
- subjective listening-study sheets for human ratings
- serving-stack reports for transformer prefill/KV-cache tradeoffs, with an optional PyTorch microbenchmark
The latest run is a 100-clip LibriSpeech dev-clean benchmark on Modal L4:
Dataset: LibriSpeech dev-clean
Clips: 100
Speakers: 40
Chapters: 97
Tokenizer: EnCodec 24 kHz, 6 kbps target bandwidth
ASR evaluator: faster-whisper tiny.en
Speaker evaluator: SpeechBrain ECAPA
| Strategy | Token Reduction | Mean WER | WER 95% CI | Speaker Sim | KV Savings |
|---|---|---|---|---|---|
baseline |
0.00% | 9.39% | 6.83%-12.40% | 1.000 | 0.00 MB |
uniform |
49.94% | 36.72% | 31.09%-43.30% | 0.527 | 230.17 MB |
acoustic_salience |
49.94% | 14.77% | 11.75%-18.13% | 0.824 | 230.17 MB |
energy_tuned_e4_t1_o2 |
49.94% | 14.98% | 12.24%-18.20% | 0.831 | 230.17 MB |
patch |
74.91% | 99.72% | 99.29%-100.00% | 0.019 | 345.27 MB |
The important comparison is not baseline vs compressed audio. It is uniform dropping vs salience-based dropping at the same token budget:
- Uniform 2x frame dropping gets roughly 50% token reduction, but WER jumps to 36.72%.
- Acoustic salience keeps the same roughly 50% token reduction, but WER is 14.77%.
- The tuned energy variant has similar WER at 14.98% and the best compressed speaker similarity at 0.831.
See the full report: REPORT.md
The broader benchmark runs the same EnCodec pipeline on 75 clips across three sources:
LibriSpeech dev-clean: 25 clips
MInDS-14 en-US: 25 clips
FLEURS en_us: 25 clips
Strategies: 7
Evaluated samples: 525
Modal run: ap-GvQq49rJPkHf3SMC3joT5H
| Strategy | Token Reduction | Mean WER | Speaker Sim | KV Savings |
|---|---|---|---|---|
baseline |
0.00% | 36.05% | 1.000 | 0.00 MB |
uniform |
49.95% | 60.61% | 0.460 | 232.97 MB |
acoustic_salience |
49.95% | 49.39% | 0.792 | 232.97 MB |
energy_salience |
49.95% | 49.05% | 0.796 | 232.97 MB |
linear_selector_v1 |
49.95% | 50.38% | 0.792 | 232.97 MB |
vad_salience |
49.95% | 51.15% | 0.792 | 232.97 MB |
patch |
74.92% | 99.55% | 0.043 | 349.47 MB |
This mixed-domain run is harder than LibriSpeech for tiny.en, so absolute WER is higher. The important signal is still the same-budget comparison: energy_salience keeps roughly 50% token reduction while improving WER by 11.56 percentage points over uniform dropping and preserving much more speaker identity.
trained_selector_v1 was distilled from the 75-clip broader run, then evaluated on a fresh held-out slice that skips the first 25 valid clips per source and uses a stronger ASR evaluator:
LibriSpeech dev-clean: 10 held-out clips
MInDS-14 en-US: 10 held-out clips
FLEURS en_us: 10 held-out clips
Strategies: 8
Evaluated samples: 240
ASR evaluator: faster-whisper small.en
Modal run: ap-OaT2jqTQV5pGMO8mLoe16i
| Strategy | Token Reduction | Mean WER | Speaker Sim | KV Savings |
|---|---|---|---|---|
baseline |
0.00% | 38.31% | 1.000 | 0.00 MB |
uniform |
49.95% | 49.93% | 0.466 | 228.25 MB |
energy_salience |
49.95% | 48.91% | 0.817 | 228.25 MB |
trained_selector_v1 |
49.95% | 48.10% | 0.821 | 228.25 MB |
patch |
74.94% | 99.87% | 0.055 | 342.43 MB |
On this held-out slice, trained_selector_v1 is the best 50% reduction strategy by WER and speaker similarity. The confidence intervals are still wide because this is a compact credit-aware run, but it closes the main validation gap: the trained selector is no longer only a fitted artifact plus smoke test.
Tracked, repo-visible artifacts:
- 100-clip result summary JSON
- 100-clip summary chart
- listening examples
- committed example WAVs
- 75-clip broader publication summary
- 75-clip broader summary chart
- 75-clip broader serving report
- 75-clip broader listening-study sheet
- 75-clip trained selector artifact
- held-out trained selector publication summary
- held-out trained selector ASR summary
- held-out trained selector speaker summary
- held-out trained selector ASR evaluator metadata
- held-out trained selector summary chart
- held-out trained selector serving report
- held-out trained selector listening-study sheet
- held-out listening-study rating summary JSON
- held-out listening-study rating summary Markdown
- broader-speech smoke publication summary
- broader-speech smoke serving report
- broader-speech smoke listening-study sheet
Generated full run artifacts are intentionally ignored from git and written under modal-runs/.
The broader-speech smoke artifact is intentionally small: 3 clips across LibriSpeech, MInDS-14, and FLEURS; 7 strategies; 21 reconstructed/evaluated samples. Its purpose is to verify the pipeline and artifact generation.
AudioTokenLab reports:
- token count and token reduction
- estimated transformer KV-cache footprint and savings
- encode/decode runtime and real-time factor
- reconstruction MSE, MAE, SNR, and duration drift
- downstream ASR WER/CER with bootstrap confidence intervals
- speaker similarity against baseline reconstruction
- failure cases with transcript and audio examples
The goal is to make tradeoffs obvious: memory savings are only useful if the resulting audio remains intelligible and voice-preserving.
audio dataset
-> audio tokenizer
-> compression strategy
-> reconstruction
-> ASR + speaker + signal evaluation
-> CSV / JSON / HTML / chart artifacts
The main neural-codec benchmark uses EnCodec. The repo also includes dependency-light dummy and mu-law tokenizers for fast local tests.
| Strategy | Description |
|---|---|
baseline |
No compression. |
uniform |
Keep every Nth EnCodec frame. Simple and cheap, but destructive. |
acoustic_salience |
Keep the frame with the strongest local RVQ-token transition inside each window, then repeat-fill the decode timeline. |
energy_salience |
Combine token-transition score with frame-energy/onset cues. |
energy_tuned_e4_t1_o2 |
Tuned energy-salience variant from the 100-clip run. |
vad_salience |
Uses frame-energy speech activity, short-run filtering, and hangover to keep likely speech/onset frames. |
linear_selector_v1 |
Linear frame selector hook over energy, onset, transition, and speech-activity features. Weights can be swapped for trained weights. |
patch |
Average codec IDs across frame windows. Kept as a failure baseline because arithmetic over discrete codec IDs is not meaningful. |
Create a local editable install:
python3 -m pip install -e .Optional extras:
python3 -m pip install -e '.[encodec]'
python3 -m pip install -e '.[modal]'
python3 -m pip install -e '.[asr]'
python3 -m pip install -e '.[speaker]'
python3 -m pip install -e '.[datasets]'
python3 -m pip install -e '.[serving]'For the full Modal benchmark, you need Modal configured:
modal setupDependency-free demo:
audiotokenlab profile --config experiments/configs/demo.json
audiotokenlab report runs/demoQuiet-segment workload:
audiotokenlab profile --config experiments/configs/quiet_demo.json
audiotokenlab report runs/quiet_demoMu-law tokenizer baseline:
audiotokenlab profile --config experiments/configs/mulaw_demo.json
audiotokenlab report runs/mulaw_demoOptional EnCodec local run:
python3 -m pip install -e '.[encodec]'
audiotokenlab profile --config experiments/configs/encodec_demo.json
audiotokenlab report runs/encodec_demoSmall EnCodec smoke run:
modal run modal_app.pySynthetic speech plus ASR smoke run:
modal run modal_app.py --speech-asrFull tuned LibriSpeech benchmark:
modal run modal_app.py --librispeech-asr --max-clips 100 --strategy-set tunedThis downloads LibriSpeech dev-clean on Modal, selects clips across speakers/chapters, converts FLAC to 24 kHz mono WAV, runs EnCodec compression/reconstruction, evaluates ASR, computes speaker similarity, and writes local artifacts under modal-runs/encodec_librispeech_asr/.
Expected full-run artifacts:
modal-runs/encodec_librispeech_asr/
manifest.json
metrics.csv
dashboard.html
asr_metrics.csv
asr_summary.json
speaker_metrics.csv
speaker_summary.json
publication_summary.json
summary_chart.svg
listening_examples.md
samples/
*.wav
Broader speech benchmark:
modal run modal_app.py --broader-speech-asr --max-clips-per-source 25 --strategy-set extended --serving-microbenchThis builds one manifest from LibriSpeech plus public Hugging Face speech corpora such as MInDS-14 and FLEURS, then runs the same EnCodec + ASR + speaker pipeline. The corpus builder is source-configurable, so you can point it at TED-LIUM, Common Voice, VoxPopuli, or internal manifests when access is available. Upstream dataset access and licenses remain governed by each provider.
Small smoke variant:
modal run modal_app.py --broader-speech-asr --max-clips-per-source 1 --strategy-set extendedFit selector weights from an evaluated run:
PYTHONPATH=src python3 -m audiotokenlab train-selector \
modal-runs/encodec_broader_speech_asr \
--output experiments/results/encodec_broader_speech_asr_modal_2026-06-16_trained_selector.jsonEvaluate the trained selector alongside the extended baselines:
modal run modal_app.py \
--broader-speech-asr \
--max-clips-per-source 10 \
--skip-clips-per-source 25 \
--strategy-set trained \
--asr-model small.en \
--serving-microbenchRemote held-out validation for strategy_set=trained: ap-OaT2jqTQV5pGMO8mLoe16i. Earlier smoke validation: ap-zN30fQb7Yz3jV025WF9ssr.
The serving report consumes metrics.csv and writes:
serving_stack_report.json
serving_stack_report.md
It estimates transformer prefill attention work, decode KV-read reduction, and KV-cache savings. With --serving-microbench, it also runs a reference PyTorch transformer layer on representative token lengths.
Subjective listening artifacts are generated with publication artifacts:
listening_study.csv
listening_study.md
listening_study.json
The CSV is an anonymized rating sheet for MOS, intelligibility, speaker match, and artifact notes.
After ratings are filled in:
PYTHONPATH=src python3 -m audiotokenlab summarize-listening \
modal-runs/encodec_broader_speech_asr/listening_study.csvThis writes listening_study_rating_summary.json and listening_study_rating_summary.md.
src/audiotokenlab/
compression.py compression strategies
profiling.py per-clip benchmark metrics
asr_eval.py WER/CER evaluation and bootstrap CIs
speaker_eval.py SpeechBrain speaker-similarity evaluation
publication.py chart and listening-example artifacts
listening_study.py subjective rating sheets
serving.py transformer-serving estimates and optional torch microbenchmarks
corpora.py broader speech dataset manifest builders
reporting.py CSV/JSON/HTML reporting
tokenizers/ dummy, mu-law, and EnCodec tokenizer adapters
experiments/
configs/ local run configs
results/ tracked benchmark summaries and public artifacts
modal_app.py Modal GPU benchmark entrypoint
REPORT.md current benchmark report
Audio models pay for long audio contexts in tokens, just like text models do. Codec-token streams can be dense, and in autoregressive or transformer-style audio systems they affect:
- prefill latency
- decode latency
- KV-cache memory
- batching efficiency
- serving cost
- time-to-first-audio
Naively dropping audio tokens saves memory but can erase phonetics, timing, speaker identity, and prosody. AudioTokenLab is a measurement harness for finding better quality/cost tradeoffs.
AudioTokenLab is motivated by recent work on efficient audio-token modeling and long-form speech systems:
- TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech
- Speech-XL
- LLM-Codec
- Ultra-Low Latency Streaming Speech Synthesis via Block-Wise Generation
- Building Enterprise Realtime Voice Agents from Scratch
This repo is not trying to reproduce those systems end to end. It builds the measurement layer around the bottleneck they expose: audio-token cost.
- The tracked headline LibriSpeech and broader runs are engineering benchmarks, not publication-grade estimates.
- The held-out trained-selector run is compact and credit-aware at 30 clips, so confidence intervals are still wide.
- ASR uses
faster-whisperevaluators, which are practical regression tests but not oracles for speech quality. - Speaker similarity uses one pretrained embedding model.
trained_selector_v1is distilled from strategy-level benchmark outcomes, not trained from frame-level labels.- Listening-study sheets and summarization are implemented, but human ratings are not included in the repo.
- KV-cache savings are architecture estimates, not measurements from a deployed audio transformer.
Completed:
- Local profiling pipeline
- Dummy and mu-law tokenizer baselines
- Optional EnCodec backend
- Modal GPU benchmark path
- LibriSpeech real-speech benchmark
- ASR WER/CER with bootstrap confidence intervals
- SpeechBrain speaker similarity
- Salience-based compression baselines
- Energy-salience tuning run
- Public report, chart, and listening examples
- Broader multi-corpus benchmark across LibriSpeech, MInDS-14, and FLEURS
- VAD-aware selector baseline
- Linear learned-selector integration point
- Trained selector weight artifact distilled from ASR and speaker outcomes
- Held-out trained-selector benchmark with
faster-whisper small.en - Subjective listening-study artifact generation
- Subjective listening-study rating summarizer
- Transformer serving-stack report and CUDA microbenchmark
Next:
- Collect human listening ratings and summarize MOS/intelligibility/speaker-match results
- Integrate a production-grade audio-token transformer or voice-agent serving loop
- Add more licensed or internal speech corpora through the source-configurable corpus builder
PYTHONPATH=src python3 -m unittest discover testsThis project is released under the MIT License.
Dataset and model licenses remain governed by their upstream providers.