Skip to content

JHU-CLSP/Speech_NSF_NextGen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

87 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”Ή Introduction

Icefall K2 Lhotse

The Kaldi open-source toolkit, developed in 2009 for automatic speech recognition (ASR), has played a foundational role in speech technology research. Its evolution continued in 2021 with the emergence of a follow-up trilogyβ€”Icefall, k2, and Lhotseβ€”demonstrating the rapid pace of innovation in speech algorithms and system design.

In response to the transformative changes in the fieldβ€”such as the rise of large language models (LLMs), platforms like Hugging Face, and robust deep learning frameworks like PyTorchβ€”this project aims to reimagine and modernize Kaldi’s capabilities to meet the emerging needs of the speech research community.

The primary objective of k2 is to re-implement all core functions of Kaldi natively in generic AI/deep learning frameworks, with a focus on PyTorch. This allows the seamless integration of cutting-edge developments in deep learning (e.g., novel optimization algorithms) into speech recognition research. The primary goals of Lhotse and Icefall include delivering efficient, user-friendly tools for data preparation, recipe development, and training modern ASR models.

GitHub statistics for Lhotse, Icefall, and k2. *within last month (as of May 16 2025)

GitHub statistic Lhotse Icefall k2
Watch 42 49 73
Fork 233 337 224
Star 1k 1.1k 1.2k
Dependent repositories 252 0 49
Merged PR* 5 5 1
Open PR* 0 3 0
Closed issue* 1 14 0
New issue* 3 4 0
Commits to master* 7 5 1
Additions* 687 79 618
Deletions* 219 170 138

Acknowledgments:

NSF Official Logo This project was supported by U.S. National Science Foundation Award Number:2120435, NSF-CCRI project: CCRI: ENS: Next Generation Tools for Spoken Language Science & Technology.

πŸ”Ή Projects

Lhotse

Lhotse develops a modern approach to speech data preparation. Its design is inspired by data libraries commonly used in the ML community, such as pandas. Lhotse's philosophy may be summarized as ''simple things should be simple, complex things should be possible.''

🎨 JHU Contributors

  • #1 Piotr Ε»elasko, 1,221 commits 🟩 110,336 ++ πŸ”΄ 43,656 --
  • #2 Desh Raj, 248 commits 🟩 29,279 ++ πŸ”΄ 12,783 --
  • #4 Jan (Yenda) Trmal, 33 commits 🟩 2,093 ++ πŸ”΄ 651 --
  • #6 Amir Hussein, 28 commits 🟩 2,747 ++ πŸ”΄ 1766 --
  • #13 Matthew Wiesner, 13 commits 🟩 3,425 ++ πŸ”΄ 627 --
  • #24 Yiming Wang, 7 commits 🟩 215 ++ πŸ”΄ 37 --
  • #39 Dominik Klement, 2 commits 🟩 1602 ++ πŸ”΄ 0 --
  • #53 Matthew Maciejewski, 1 commit 🟩 1,217 ++ πŸ”΄ 0 --
  • #74 Henry Li Xinyuan, 1 commit 🟩 146 ++ πŸ”΄ 0 --
  • #76 Dongji Gao, 1 commit 🟩 5 ++ πŸ”΄ 3 --

🎨 GPU-accelerated Guided Source Separation (by Desh Raj)

Improved implementation of GSS that leverages the power of modern GPU-based pipelines, such as batched processing of frequencies and segments. This allows us to perform detailed ablation studies over several parameters of the GSS algorithm. There are reproducible pipelines for speaker-attributed transcription of popular meeting benchmarks: LibriCSS, AMI, and AliMeeting.

🎨 Lhotse recipes


Icefall

Icefall is the project where K2 and Lhotse ''meet''. It provides the speech and language research community a comprehensive collection of recipes for training modern speech processing systems on most of the popular speech data sets.

🎨 JHU Contributors

  • #11 Piotr Ε»elasko, 18 commits 🟩 993 ++ πŸ”΄ 838 --
  • #15 Ruizhe Huang, 7 commits 🟩 95 ++ πŸ”΄ 74 --
  • #38 Dongji Gao, 2 commits 🟩 9,565 ++ πŸ”΄ 9 --
  • #43 Henry Li Xinyuan, 1 commit 🟩 2,124 ++ πŸ”΄ 3 --
  • #67 Amir Hussein, 1 commit 🟩 6,114 ++ πŸ”΄ 1 --
  • #100 Jan "yenda" Trmal, 1 commit 🟩 1 ++ πŸ”΄ 1 --

🎨 External Contributors

  • #2 Dan Povey, 200 commits 🟩 13,323 ++ πŸ”΄ 4,485 --

🎨 Continuous Streaming Multi-Talker ASR (by Desh Raj)

We investigated Streaming Unmixing and Recognition Transducer (SURT) for continuous streaming multitalker ASR, and demonstrated the effectiveness of dual-path LSTMs and Transformers for generalization to diverse session lengths (recipes for the LibriCSS, AMI and ICSI datasets).

🎨 SPGISpeech (by Desh Raj)

We developed an Icefall recipe and trained models (zipformer and stateless transducer models on Hugging Face) for SPGISpeech, a dataset consisting of 5,000 hours of recorded company earnings calls and their respective transcriptions.

🎨 MGB-2 (by Amir Hussein)

We developed an Icefall recipe and trained a model (conformer-ctc model on Hugging Face) for Multi-Dialect Broadcast News Arabic Speech Recognition (MGB-2) challenge on Arabic multi-dialect broadcast media recognition.

🎨 Contextual ASR (by Ruizhe Huang, Mahsa Yarmohammadi)

Developed recipes for Contextual ASR. This is the process by which an ASR system is provided with contextual information derived from metadata associated with the audio, typically in the form of a list of words or phrases likely to be spoken, with the goal of improving the recognition accuracy of named entities and other infrequent terms. Our work on Contextual ASR is recognized for introducing the ConEC dataset (ConEC), followed by a method for improving neural biasing beyond shallow language model fusion (Pull request - Neural Biasing).

🎨 Omni-temporal Classification (OTC) (by Dongji Gao, Paola Garcia, Matthew Wiesner)

Training ASR systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. We designed and implemented Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts (OTC w/ BPE units, OTC w/ phone units).

🎨 ASR + LID (by Amir Hussein, Paola Garcia, Matthew Wiesner)

Created a multitask learning framework that synchronizes Language Identification (LID) with ASR, utilizing a neural transducer architecture. We demonstrate the efficacy of our proposed approach on conversational multilingual (Arabic, Spanish, Mandarin) and CS (Spanish-English, Mandarin-English) test sets (Pull request - ASR SEAME Recipe).

🎨 Other recipes


K2

K2 brings data structures and algorithms from the field of finite state automata (FSA) into the world of deep learning. It provides efficient CPU and GPU implementations of commonly used FSA operations and integrates them seamlessly with PyTorch's tensor and automatic differentiation mechanisms, thus admitting - and benefiting from - the inner complexity of the speech recognition, instead of trying to remove it.

🎨 JHU Contributors

  • #12 Piotr Ε»elasko, 4 commits 🟩 9,458 ++ πŸ”΄ 276 --
  • #13 Jan "yenda" Trmal, 4 commits 🟩 9,314 ++ πŸ”΄ 267 --
  • #15 Yiming Wang, 3 commits 🟩 234 ++ πŸ”΄ 67 --
  • #20 Desh Raj, 2 commits 🟩 435 ++ πŸ”΄ 29 --
  • #27 Mahsa Yarmohammadi, 2 commits 🟩 169 ++ πŸ”΄ 51 --
  • #35 Dongji Gao, 1 commit 🟩 27 ++ πŸ”΄ 10 --

🎨 External Contributors

  • #2 Dan Povey, 214 commits 🟩 73,771 ++ πŸ”΄ 30,586 --

🎨 k2 codes

  • Test whether FSA is acyclic.
  • Fast parallel computation of longest common prefixes for efficient pattern matching (kmp-LCP).
  • Implementation of the Hybrid Autoregressive Transducer loss (HAT).

πŸ”Ή Other projects

Other

🎨

The CHiME 8 submission relied on Lhotse for data preparation, audio loading, data manipulation, and constructing the PyTorch dataloaders. Once in this format, it was possible to interface with standard Whisper training recipes in the Whisper GitHub repo or on Hugging Face.Β 

🎨 Target Speaker ASR with Whisper (by BUT in collaboration with Dominik Klement, Matthew Wiesner)

The repository contains the official implementation of the following publications: Target Speaker Whisper and DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition. It relied on Lhotse for homogenizing the data preparation across various datasets, such as AMI, Librispeech. This project is a collaboration between Alex Polok and LukΓ‘s Burget from BUT and Dominik Klement, Matthew Wiesner

🎨 Hugging Face

Researchers and developers increasingly rely on the open-source platform Hugging Face for pre-trained models, datasets, and tools to efficiently build and deploy AI applications. k2-fsa is available on Hugging Face. As of now, it has published one dataset (LibriSpeech) and 18 models. Additionally, 30 HF Spaces have been released, offering inference APIs and demos for tasks such as speech recognition, text-to-speech, audio tagging, and spoken language identification using Next-gen Kaldi.

About

NSF CCRI ENS Project: Next Generation Tools for Spoken Language Science & Technology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •