Skip to content

State-of-the-Art Offline Speech Recognition model for Romanian.

Notifications You must be signed in to change notification settings

gabitza-tech/SpeD-RoASR

Repository files navigation

Open Source State-Of-the-Art Solution for Romanian Speech Recognition — SpeD 2025 🗣️

This repository contains the codebase and resources developed for the paper:

Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Presented at SpeD 2025 – International Conference on Speech Technology and Human–Computer Dialogue

This project is built on top of the NVIDIA NeMo framework (version 2.3.1) and focuses on developing a Romanian Offline ASR model based on the FastConformer Hybrid TDT-CTC architecture. We further enhance decoding with an external KenLM N-gram language model for improved accuracy.

Our system achieves state-of-the-art (SOTA) performance across 7 different Romanian evaluation datasets.


🧠 Overview

  • 🏆 Model: FastConformer Hybrid TDT-CTC
  • 🇷🇴 Language: Romanian
  • 📊 Benchmark: 7 public evaluation datasets
  • 🪄 Features:
    • Offline ASR for Romanian
    • Trained N-gram Language model on Romanian text
    • Fully reproducible evaluation setup

This repository provides:

  • The model training and inference pipeline (adapted from NeMo)

  • Evaluation scripts and annotations used for benchmarking

  • Ready-to-use pretrained models hosted on Hugging Face

  • The manifests/ folder provides the annotation manifests used to evaluate our ASR model on 7 datasets. This allows other researchers to benchmark their own systems under identical conditions, ensuring fair comparison in future studies.


🤗 Hugging Face Models

These models can be directly integrated with the pipeline in this repository.


⚡ Installation Instructions

This project builds on top of NVIDIA NeMo.
We recommend installing it via Conda + Pip for most use cases.

1. Conda / Pip (Recommended)

# Create and activate a fresh environment
conda create --name nemo python==3.10.12
conda activate nemo

# Install NeMo Toolkit
pip install "nemo_toolkit[all]"

If you want a specific NeMo version:

git clone https://github.com/NVIDIA/NeMo
cd NeMo
git checkout @${REF:-'main'}
pip install '.[all]'

2. Install Specific Domains

This project primarily uses the ASR domain, but you may install other domains if needed.

pip install "nemo_toolkit[asr]"        # ASR domain (required)
pip install "nemo_toolkit[nlp]"        # Optional
pip install "nemo_toolkit[tts]"        # Optional
pip install "nemo_toolkit[vision]"     # Optional
pip install "nemo_toolkit[multimodal]" # Optional

Evaluation & Benchmarking

This model was evaluated on 7 Romanian datasets, covering various domains and accents. The manifests/ folder includes the exact annotation files used for these benchmarks, ensuring reproducible research.

E.g.: to evaluate your own model with an N-gram model on the SSC-eval1 dataset:

cd examples/asr

python3 speech_to_text_eval.py \
dataset_manifest=../../manifests/SSC-eval1_manifest.json \
model_path=... \
output_filename=... \
decoder_type=ctc \
ctc_decoding.strategy=beam \
ctc_decoding.beam.kenlm_path=... \
ctc_decoding.beam.beam_alpha=... \
ctc_decoding.beam.beam_beta=... \
ctc_decoding.beam.beam_size=...

Results

Architecture Decoding RSC-eval SSC-eval1 SSC-eval2 CDEP-eval CV-21 Fleurs-RO USPDATRO RTFx
Parakeet Ro 110M TDT (ours) Greedy 2.16 9.08 10.85 4.20 3.57 10.61 24.08 126.15
ALSD 2.05 8.64 10.88 4.17 3.38 10.16 24.30 66.63
Parakeet Ro 110M CTC (ours) Greedy 2.57 10.10 12.65 4.80 4.20 11.85 27.80 130.55
Beam Token N-gram 1.73 8.12 10.75 3.92 3.29 8.85 23.40 109.46

Citation

If you use this repository, the pretrained models, or the provided manifests in your work, please cite:

@misc{pirlogeanu2025opensourcestateoftheartsolution,
      title={Open Source State-Of-the-Art Solution for Romanian Speech Recognition}, 
      author={Gabriel Pirlogeanu and Alexandru-Lucian Georgescu and Horia Cucu},
      year={2025},
      eprint={2511.03361},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2511.03361}, 
}

Also consider citing the original NVIDIA NeMo framework and KenLM:

@article{kuchaiev2019nemo,
  title={NeMo: a toolkit for building AI applications using Neural Modules},
  author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
  journal={arXiv preprint arXiv:1909.09577},
  year={2019}
}

@inproceedings{heafield-2011-kenlm,
    title = "{K}en{LM}: Faster and Smaller Language Model Queries",
    author = "Heafield, Kenneth",
    editor = "Callison-Burch, Chris  and
      Koehn, Philipp  and
      Monz, Christof  and
      Zaidan, Omar F.",
    booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
    month = jul,
    year = "2011",
    address = "Edinburgh, Scotland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W11-2123/",
    pages = "187--197"
}

License

  • Portions of this code are derived from NVIDIA NeMo under the Apache License 2.0.

  • Additional modifications and contributions © 2025 [Gabriel Pirlogeanu].

  • Evaluation manifests are released for research use.


Contact

For questions or collaborations: gabriel.pirlogeanu@gmail.com

About

State-of-the-Art Offline Speech Recognition model for Romanian.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published