🔹 Introduction

The Kaldi open-source toolkit, developed in 2009 for automatic speech recognition (ASR), has played a foundational role in speech technology research. Its evolution continued in 2021 with the emergence of a follow-up trilogy—Icefall, k2, and Lhotse—demonstrating the rapid pace of innovation in speech algorithms and system design.

In response to the transformative changes in the field—such as the rise of large language models (LLMs), platforms like Hugging Face, and robust deep learning frameworks like PyTorch—this project aims to reimagine and modernize Kaldi’s capabilities to meet the emerging needs of the speech research community.

The primary objective of k2 is to re-implement all core functions of Kaldi natively in generic AI/deep learning frameworks, with a focus on PyTorch. This allows the seamless integration of cutting-edge developments in deep learning (e.g., novel optimization algorithms) into speech recognition research. The primary goals of Lhotse and Icefall include delivering efficient, user-friendly tools for data preparation, recipe development, and training modern ASR models.

GitHub statistics for Lhotse, Icefall, and k2. ^*within last month (as of May 16 2025)

GitHub statistic	Lhotse	Icefall	k2
Watch	42	49	73
Fork	233	337	224
Star	1k	1.1k	1.2k
Dependent repositories	252	0	49
Merged PR^*	5	5	1
Open PR^*	0	3	0
Closed issue^*	1	14	0
New issue^*	3	4	0
Commits to master^*	7	5	1
Additions^*	687	79	618
Deletions^*	219	170	138

Acknowledgments:

This project was supported by U.S. National Science Foundation Award Number:2120435, NSF-CCRI project: CCRI: ENS: Next Generation Tools for Spoken Language Science & Technology.

🔹 Projects

Lhotse develops a modern approach to speech data preparation. Its design is inspired by data libraries commonly used in the ML community, such as pandas. Lhotse's philosophy may be summarized as ''simple things should be simple, complex things should be possible.''

🎨 JHU Contributors

#1 Piotr Żelasko, 1,221 commits 🟩 110,336 ++ 🔴 43,656 --
#2 Desh Raj, 248 commits 🟩 29,279 ++ 🔴 12,783 --
#4 Jan (Yenda) Trmal, 33 commits 🟩 2,093 ++ 🔴 651 --
#6 Amir Hussein, 28 commits 🟩 2,747 ++ 🔴 1766 --
#13 Matthew Wiesner, 13 commits 🟩 3,425 ++ 🔴 627 --
#24 Yiming Wang, 7 commits 🟩 215 ++ 🔴 37 --
#39 Dominik Klement, 2 commits 🟩 1602 ++ 🔴 0 --
#53 Matthew Maciejewski, 1 commit 🟩 1,217 ++ 🔴 0 --
#74 Henry Li Xinyuan, 1 commit 🟩 146 ++ 🔴 0 --
#76 Dongji Gao, 1 commit 🟩 5 ++ 🔴 3 --

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

Improved implementation of GSS that leverages the power of modern GPU-based pipelines, such as batched processing of frequencies and segments. This allows us to perform detailed ablation studies over several parameters of the GSS algorithm. There are reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting.

🎨 Lhotse recipes

recipes/aishell3.py speech/asr
recipes/atcosim.py speech/asr
recipes/but_reverb_db.py reverberation database
recipes/chime6.py speech/asr
recipes/csj.py speech/asr
recipes/cmu_kids.py speech/asr
recipes/dipco.py speech/asr
recipes/edacc.py speech/asr
recipes/gigast.py speech-translation
recipes/himia.py speaker-verification
recipes/librilight.py speech/asr
recipes/must_c.py speech-translation
recipes/speechcommands.py speech/hotword-detection
recipes/uwb_atcc.py speech/asr
recipes/xbmu_amdo31.py speech/asr
Fleurs speech/language-id
radio stations speech/asr speech/language-id database
SBCASE speech/diarization database

Icefall is the project where K2 and Lhotse ''meet''. It provides the speech and language research community a comprehensive collection of recipes for training modern speech processing systems on most of the popular speech data sets.

🎨 JHU Contributors

#11 Piotr Żelasko, 18 commits 🟩 993 ++ 🔴 838 --
#15 Ruizhe Huang, 7 commits 🟩 95 ++ 🔴 74 --
#38 Dongji Gao, 2 commits 🟩 9,565 ++ 🔴 9 --
#43 Henry Li Xinyuan, 1 commit 🟩 2,124 ++ 🔴 3 --
#67 Amir Hussein, 1 commit 🟩 6,114 ++ 🔴 1 --
#100 Jan "yenda" Trmal, 1 commit 🟩 1 ++ 🔴 1 --

🎨 External Contributors

#2 Dan Povey, 200 commits 🟩 13,323 ++ 🔴 4,485 --

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

We investigated Streaming Unmixing and Recognition Transducer (SURT) for continuous streaming multitalker ASR, and demonstrated the effectiveness of dual-path LSTMs and Transformers for generalization to diverse session lengths (recipes for the LibriCSS, AMI and ICSI datasets).

🎨 SPGISpeech (by Desh Raj)

We developed an Icefall recipe and trained models (zipformer and stateless transducer models on Hugging Face) for SPGISpeech, a dataset consisting of 5,000 hours of recorded company earnings calls and their respective transcriptions.

🎨 MGB-2 (by Amir Hussein)

We developed an Icefall recipe and trained a model (conformer-ctc model on Hugging Face) for Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2) challenge on Arabic multi-dialect broadcast media recognition.

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

Developed recipes for Contextual ASR. This is the process by which an ASR system is provided with contextual information derived from metadata associated with the audio, typically in the form of a list of words or phrases likely to be spoken, with the goal of improving the recognition accuracy of named entities and other infrequent terms. Our work on Contextual ASR is recognized for introducing the ConEC dataset (ConEC), followed by a method for improving neural biasing beyond shallow language model fusion (Pull request - Neural Biasing).

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

Training ASR systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. We designed and implemented Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts (OTC w/ BPE units, OTC w/ phone units).

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

Created a multitask learning framework that synchronizes Language Identification (LID) with ASR, utilizing a neural transducer architecture. We demonstrate the efficacy of our proposed approach on conversational multilingual (Arabic, Spanish, Mandarin) and CS (Spanish-English, Mandarin-English) test sets (Pull request - ASR SEAME Recipe).

🎨 Other recipes

Kneser-Ney language model smoothing
Librispeech - partial contribution
Fluent Speech Commands recipe
Recipe for Geolocation dataset using Lhotse and Icefall

K2 brings data structures and algorithms from the field of finite state automata (FSA) into the world of deep learning. It provides efficient CPU and GPU implementations of commonly used FSA operations and integrates them seamlessly with PyTorch's tensor and automatic differentiation mechanisms, thus admitting - and benefiting from - the inner complexity of the speech recognition, instead of trying to remove it.

🎨 JHU Contributors

#12 Piotr Żelasko, 4 commits 🟩 9,458 ++ 🔴 276 --
#13 Jan "yenda" Trmal, 4 commits 🟩 9,314 ++ 🔴 267 --
#15 Yiming Wang, 3 commits 🟩 234 ++ 🔴 67 --
#20 Desh Raj, 2 commits 🟩 435 ++ 🔴 29 --
#27 Mahsa Yarmohammadi, 2 commits 🟩 169 ++ 🔴 51 --
#35 Dongji Gao, 1 commit 🟩 27 ++ 🔴 10 --

🎨 External Contributors

#2 Dan Povey, 214 commits 🟩 73,771 ++ 🔴 30,586 --

🎨 k2 codes

Test whether FSA is acyclic.
Fast parallel computation of longest common prefixes for eﬃcient pattern matching (kmp-LCP).
Implementation of the Hybrid Autoregressive Transducer loss (HAT).

🔹 Other projects

🎨

The CHiME 8 submission relied on Lhotse for data preparation, audio loading, data manipulation, and constructing the PyTorch dataloaders. Once in this format, it was possible to interface with standard Whisper training recipes in the Whisper GitHub repo or on Hugging Face.

Data preparation for k2/Icefall and ESPNet.
Usage prepares CHiME-8 data lhotse manifests.
Manifest preparation for different toolkits (Datasets included: Dipco, mixer6, notsofar1, CHiME6).

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

The repository contains the official implementation of the following publications: Target Speaker Whisper and DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. It relied on Lhotse for homogenizing the data preparation across various datasets, such as AMI, Librispeech. This project is a collaboration between Alex Polok and Lukás Burget from BUT and Dominik Klement, Matthew Wiesner

Data preparation (AMI, Librispeech, etc)

🎨 Hugging Face

Researchers and developers increasingly rely on the open-source platform Hugging Face for pre-trained models, datasets, and tools to efficiently build and deploy AI applications. k2-fsa is available on Hugging Face. As of now, it has published one dataset (LibriSpeech) and 18 models. Additionally, 30 HF Spaces have been released, offering inference APIs and demos for tasks such as speech recognition, text-to-speech, audio tagging, and spoken language identification using Next-gen Kaldi.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
README.md		README.md
_config.yml		_config.yml
index.md		index.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔹 Introduction

Acknowledgments:

🔹 Projects

🎨 JHU Contributors

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

🎨 Lhotse recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

🎨 SPGISpeech (by Desh Raj)

🎨 MGB-2 (by Amir Hussein)

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Other recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 k2 codes

🔹 Other projects

🎨

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

🎨 Hugging Face

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

JHU-CLSP/Speech_NSF_NextGen

Folders and files

Latest commit

History

Repository files navigation

🔹 Introduction

Acknowledgments:

🔹 Projects

🎨 JHU Contributors

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

🎨 Lhotse recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

🎨 SPGISpeech (by Desh Raj)

🎨 MGB-2 (by Amir Hussein)

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

🎨 Other recipes

🎨 JHU Contributors

🎨 External Contributors

🎨 k2 codes

🔹 Other projects

🎨

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

🎨 Hugging Face

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages