The Kaldi open-source toolkit, developed in 2009 for automatic speech recognition (ASR), has played a foundational role in speech technology research. Its evolution continued in 2021 with the emergence of a follow-up trilogyβIcefall, k2, and Lhotseβdemonstrating the rapid pace of innovation in speech algorithms and system design.
In response to the transformative changes in the fieldβsuch as the rise of large language models (LLMs), platforms like Hugging Face, and robust deep learning frameworks like PyTorchβthis project aims to reimagine and modernize Kaldiβs capabilities to meet the emerging needs of the speech research community.
The primary objective of k2 is to re-implement all core functions of Kaldi natively in generic AI/deep learning frameworks, with a focus on PyTorch. This allows the seamless integration of cutting-edge developments in deep learning (e.g., novel optimization algorithms) into speech recognition research. The primary goals of Lhotse and Icefall include delivering efficient, user-friendly tools for data preparation, recipe development, and training modern ASR models.
GitHub statistics for Lhotse, Icefall, and k2. *within last month (as of May 16 2025)
GitHub statistic | Lhotse | Icefall | k2 |
---|---|---|---|
Watch | 42 | 49 | 73 |
Fork | 233 | 337 | 224 |
Star | 1k | 1.1k | 1.2k |
Dependent repositories | 252 | 0 | 49 |
Merged PR* | 5 | 5 | 1 |
Open PR* | 0 | 3 | 0 |
Closed issue* | 1 | 14 | 0 |
New issue* | 3 | 4 | 0 |
Commits to master* | 7 | 5 | 1 |
Additions* | 687 | 79 | 618 |
Deletions* | 219 | 170 | 138 |
This project was supported by U.S. National Science Foundation Award Number:2120435, NSF-CCRI project: CCRI: ENS: Next Generation Tools for Spoken Language Science & Technology.
Lhotse develops a modern approach to speech data preparation. Its design is inspired by data libraries commonly used in the ML community, such as pandas. Lhotse's philosophy may be summarized as ''simple things should be simple, complex things should be possible.''
- #1 Piotr Ε»elasko, 1,221 commits π© 110,336 ++ π΄ 43,656 --
- #2 Desh Raj, 248 commits π© 29,279 ++ π΄ 12,783 --
- #4 Jan (Yenda) Trmal, 33 commits π© 2,093 ++ π΄ 651 --
- #6 Amir Hussein, 28 commits π© 2,747 ++ π΄ 1766 --
- #13 Matthew Wiesner, 13 commits π© 3,425 ++ π΄ 627 --
- #24 Yiming Wang, 7 commits π© 215 ++ π΄ 37 --
- #39 Dominik Klement, 2 commits π© 1602 ++ π΄ 0 --
- #53 Matthew Maciejewski, 1 commit π© 1,217 ++ π΄ 0 --
- #74 Henry Li Xinyuan, 1 commit π© 146 ++ π΄ 0 --
- #76 Dongji Gao, 1 commit π© 5 ++ π΄ 3 --
π¨ GPU-accelerated Guided Source Separation (by Desh Raj)
Improved implementation of GSS that leverages the power of modern GPU-based pipelines, such as batched processing of frequencies and segments. This allows us to perform detailed ablation studies over several parameters of the GSS algorithm. There are reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting.
-
recipes/aishell3.py speech/asr
-
recipes/atcosim.py speech/asr
-
recipes/but_reverb_db.py reverberation database
-
recipes/chime6.py speech/asr
-
recipes/csj.py speech/asr
-
recipes/cmu_kids.py speech/asr
-
recipes/dipco.py speech/asr
-
recipes/edacc.py speech/asr
-
recipes/gigast.py speech-translation
-
recipes/himia.py speaker-verification
-
recipes/librilight.py speech/asr
-
recipes/must_c.py speech-translation
-
recipes/speechcommands.py speech/hotword-detection
-
recipes/uwb_atcc.py speech/asr
-
recipes/xbmu_amdo31.py speech/asr
-
Fleurs speech/language-id
-
radio stations speech/asr speech/language-id database
-
SBCASE speech/diarization database
Icefall is the project where K2 and Lhotse ''meet''. It provides the speech and language research community a comprehensive collection of recipes for training modern speech processing systems on most of the popular speech data sets.
- #11 Piotr Ε»elasko, 18 commits π© 993 ++ π΄ 838 --
- #15 Ruizhe Huang, 7 commits π© 95 ++ π΄ 74 --
- #38 Dongji Gao, 2 commits π© 9,565 ++ π΄ 9 --
- #43 Henry Li Xinyuan, 1 commit π© 2,124 ++ π΄ 3 --
- #67 Amir Hussein, 1 commit π© 6,114 ++ π΄ 1 --
- #100 Jan "yenda" Trmal, 1 commit π© 1 ++ π΄ 1 --
- #2 Dan Povey, 200 commits π© 13,323 ++ π΄ 4,485 --
π¨ Continuous Streaming Multi-Talker ASR (by Desh Raj)
We investigated Streaming Unmixing and Recognition Transducer (SURT) for continuous streaming multitalker ASR, and demonstrated the effectiveness of dual-path LSTMs and Transformers for generalization to diverse session lengths (recipes for the LibriCSS, AMI and ICSI datasets).
π¨ SPGISpeech (by Desh Raj)
We developed an Icefall recipe and trained models (zipformer and stateless transducer models on Hugging Face) for SPGISpeech, a dataset consisting of 5,000 hours of recorded company earnings calls and their respective transcriptions.
π¨ MGB-2 (by Amir Hussein)
We developed an Icefall recipe and trained a model (conformer-ctc model on Hugging Face) for Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2) challenge on Arabic multi-dialect broadcast media recognition.
Developed recipes for Contextual ASR. This is the process by which an ASR system is provided with contextual information derived from metadata associated with the audio, typically in the form of a list of words or phrases likely to be spoken, with the goal of improving the recognition accuracy of named entities and other infrequent terms. Our work on Contextual ASR is recognized for introducing the ConEC dataset (ConEC), followed by a method for improving neural biasing beyond shallow language model fusion (Pull request - Neural Biasing).
π¨ Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)
Training ASR systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. We designed and implemented Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts (OTC w/ BPE units, OTC w/ phone units).
π¨ ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)
Created a multitask learning framework that synchronizes Language Identification (LID) with ASR, utilizing a neural transducer architecture. We demonstrate the efficacy of our proposed approach on conversational multilingual (Arabic, Spanish, Mandarin) and CS (Spanish-English, Mandarin-English) test sets (Pull request - ASR SEAME Recipe).
- Kneser-Ney language model smoothing
- Librispeech - partial contribution
- Fluent Speech Commands recipe
- Recipe for Geolocation dataset using Lhotse and Icefall
K2 brings data structures and algorithms from the field of finite state automata (FSA) into the world of deep learning. It provides efficient CPU and GPU implementations of commonly used FSA operations and integrates them seamlessly with PyTorch's tensor and automatic differentiation mechanisms, thus admitting - and benefiting from - the inner complexity of the speech recognition, instead of trying to remove it.
- #12 Piotr Ε»elasko, 4 commits π© 9,458 ++ π΄ 276 --
- #13 Jan "yenda" Trmal, 4 commits π© 9,314 ++ π΄ 267 --
- #15 Yiming Wang, 3 commits π© 234 ++ π΄ 67 --
- #20 Desh Raj, 2 commits π© 435 ++ π΄ 29 --
- #27 Mahsa Yarmohammadi, 2 commits π© 169 ++ π΄ 51 --
- #35 Dongji Gao, 1 commit π© 27 ++ π΄ 10 --
- #2 Dan Povey, 214 commits π© 73,771 ++ π΄ 30,586 --
- Test whether FSA is acyclic.
- Fast parallel computation of longest common prefixes for eο¬cient pattern matching (kmp-LCP).
- Implementation of the Hybrid Autoregressive Transducer loss (HAT).
The CHiME 8 submission relied on Lhotse for data preparation, audio loading, data manipulation, and constructing the PyTorch dataloaders. Once in this format, it was possible to interface with standard Whisper training recipes in the Whisper GitHub repo or on Hugging Face.Β
- Data preparation for k2/Icefall and ESPNet.
- Usage prepares CHiME-8 data lhotse manifests.
- Manifest preparation for different toolkits (Datasets included: Dipco, mixer6, notsofar1, CHiME6).
π¨ Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)
The repository contains the official implementation of the following publications: Target Speaker Whisper and DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. It relied on Lhotse for homogenizing the data preparation across various datasets, such as AMI, Librispeech. This project is a collaboration between Alex Polok and LukΓ‘s Burget from BUT and Dominik Klement, Matthew Wiesner
- Data preparation (AMI, Librispeech, etc)
Researchers and developers increasingly rely on the open-source platform Hugging Face for pre-trained models, datasets, and tools to efficiently build and deploy AI applications. k2-fsa is available on Hugging Face. As of now, it has published one dataset (LibriSpeech) and 18 models. Additionally, 30 HF Spaces have been released, offering inference APIs and demos for tasks such as speech recognition, text-to-speech, audio tagging, and spoken language identification using Next-gen Kaldi.