This repository contains the codebase and resources developed for the paper:
Open Source State-Of-the-Art Solution for Romanian Speech Recognition
Presented at SpeD 2025 – International Conference on Speech Technology and Human–Computer Dialogue
This project is built on top of the NVIDIA NeMo framework (version 2.3.1) and focuses on developing a Romanian Offline ASR model based on the FastConformer Hybrid TDT-CTC architecture. We further enhance decoding with an external KenLM N-gram language model for improved accuracy.
Our system achieves state-of-the-art (SOTA) performance across 7 different Romanian evaluation datasets.
- 🏆 Model: FastConformer Hybrid TDT-CTC
- 🇷🇴 Language: Romanian
- 📊 Benchmark: 7 public evaluation datasets
- 🪄 Features:
- Offline ASR for Romanian
- Trained N-gram Language model on Romanian text
- Fully reproducible evaluation setup
This repository provides:
-
The model training and inference pipeline (adapted from NeMo)
-
Evaluation scripts and annotations used for benchmarking
-
Ready-to-use pretrained models hosted on Hugging Face
-
The
manifests/folder provides the annotation manifests used to evaluate our ASR model on 7 datasets. This allows other researchers to benchmark their own systems under identical conditions, ensuring fair comparison in future studies.
- ASR Model: SpeD_ParakeetRo_110M_TDT-CTC
- N-gram Language Model: SpeD-Ro_6gram-tokens-prune0135
These models can be directly integrated with the pipeline in this repository.
This project builds on top of NVIDIA NeMo.
We recommend installing it via Conda + Pip for most use cases.
# Create and activate a fresh environment
conda create --name nemo python==3.10.12
conda activate nemo
# Install NeMo Toolkit
pip install "nemo_toolkit[all]"If you want a specific NeMo version:
git clone https://github.com/NVIDIA/NeMo
cd NeMo
git checkout @${REF:-'main'}
pip install '.[all]'This project primarily uses the ASR domain, but you may install other domains if needed.
pip install "nemo_toolkit[asr]" # ASR domain (required)
pip install "nemo_toolkit[nlp]" # Optional
pip install "nemo_toolkit[tts]" # Optional
pip install "nemo_toolkit[vision]" # Optional
pip install "nemo_toolkit[multimodal]" # OptionalThis model was evaluated on 7 Romanian datasets, covering various domains and accents.
The manifests/ folder includes the exact annotation files used for these benchmarks, ensuring reproducible research.
E.g.: to evaluate your own model with an N-gram model on the SSC-eval1 dataset:
cd examples/asr
python3 speech_to_text_eval.py \
dataset_manifest=../../manifests/SSC-eval1_manifest.json \
model_path=... \
output_filename=... \
decoder_type=ctc \
ctc_decoding.strategy=beam \
ctc_decoding.beam.kenlm_path=... \
ctc_decoding.beam.beam_alpha=... \
ctc_decoding.beam.beam_beta=... \
ctc_decoding.beam.beam_size=...| Architecture | Decoding | RSC-eval | SSC-eval1 | SSC-eval2 | CDEP-eval | CV-21 | Fleurs-RO | USPDATRO | RTFx |
|---|---|---|---|---|---|---|---|---|---|
| Parakeet Ro 110M TDT (ours) | Greedy | 2.16 | 9.08 | 10.85 | 4.20 | 3.57 | 10.61 | 24.08 | 126.15 |
| ALSD | 2.05 | 8.64 | 10.88 | 4.17 | 3.38 | 10.16 | 24.30 | 66.63 | |
| Parakeet Ro 110M CTC (ours) | Greedy | 2.57 | 10.10 | 12.65 | 4.80 | 4.20 | 11.85 | 27.80 | 130.55 |
| Beam Token N-gram | 1.73 | 8.12 | 10.75 | 3.92 | 3.29 | 8.85 | 23.40 | 109.46 |
If you use this repository, the pretrained models, or the provided manifests in your work, please cite:
@misc{pirlogeanu2025opensourcestateoftheartsolution,
title={Open Source State-Of-the-Art Solution for Romanian Speech Recognition},
author={Gabriel Pirlogeanu and Alexandru-Lucian Georgescu and Horia Cucu},
year={2025},
eprint={2511.03361},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2511.03361},
}
Also consider citing the original NVIDIA NeMo framework and KenLM:
@article{kuchaiev2019nemo,
title={NeMo: a toolkit for building AI applications using Neural Modules},
author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
journal={arXiv preprint arXiv:1909.09577},
year={2019}
}
@inproceedings{heafield-2011-kenlm,
title = "{K}en{LM}: Faster and Smaller Language Model Queries",
author = "Heafield, Kenneth",
editor = "Callison-Burch, Chris and
Koehn, Philipp and
Monz, Christof and
Zaidan, Omar F.",
booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
month = jul,
year = "2011",
address = "Edinburgh, Scotland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W11-2123/",
pages = "187--197"
}
-
Portions of this code are derived from NVIDIA NeMo under the Apache License 2.0.
-
Additional modifications and contributions © 2025 [Gabriel Pirlogeanu].
-
Evaluation manifests are released for research use.
For questions or collaborations: gabriel.pirlogeanu@gmail.com