Purpose : Evaluate NeMo ASR models on early grade reading assessments (EGRA) for Kiswahili child speech. The pipeline
- transcribes audio with a supplied NeMo model and
- computes EGRA KPIs (annotator-based and ASR-based) plus ASR quality metrics.
Quick summary — what you must provide
- A NeMo ASR model file (place under
nemo_inference/models/). - One dataset folder under
input_output_data/input/<dataset>/containing:0_Audio/(learner subfolders with WAVs)2_TextGrid/(annotator TextGrid files)Student_Full_Canonical_EGRA_*.csvandStudent_MetaData_EGRA_*.csv
- The oral passages CSV:
input_output_data/input/oral_passages.csv(used for passage tasks).
What the pipeline produces
input_output_data/output/experiments/<experiment>/egra_eval_detailed.csv— per-item metrics and counts.egra_eval_summary.txt— high-level 6-line snapshot (EGRA & ASR averages).- Per-pair summary folders:
can_ref/,can_hyp/,ref_hyp/with per-speaker and per-category CSVs.
Prerequisites
- Docker (recommended). Optional: NVIDIA Container Toolkit for GPU runs.
Recommended quick steps (full details below)
- Build the container image (CPU):
docker compose build(GPU:docker compose build --build-arg TORCH_CUDA=cu121and run with--gpus all.) - Prepare files:
- Place the model at
nemo_inference/models/<model>.nemo. - Copy a single dataset folder into
input_output_data/input/<dataset>/and putoral_passages.csvatinput_output_data/input/oral_passages.csv.
- Place the model at
- Run inference (creates
transcriptions.jsonl):./run_inference.sh \ --dataset_root input_output_data/input/<dataset> \ --output_dir input_output_data/output/<dataset>/nemo_asr_output \ --model nemo_inference/models/<model>.nemo
- Run evaluation (attach manifest from inference):
./run_eval.sh \ --dataset_root input_output_data/input/<dataset> \ --output_root input_output_data/output/experiments/<experiment> \ --passages_csv input_output_data/input/oral_passages.csv \ --nemo_manifest input_output_data/output/<dataset>/nemo_asr_output/transcriptions.jsonl
Notes and important behaviour
- TextGrid matching: the evaluator searches
2_TextGrid/recursively and picks the first.TextGridwhose stem matches the audio file. - Normalization: all texts are normalized (lowercase, punctuation removed, Unicode NFC) before scoring.
- Metrics: WER and ACC are reported as percentages (0–100). Count fields (S/D/I/C/N) are integers.
- If
N == 0for an item, ratio metrics will beNaNin outputs.
More details and extended configuration are provided below.
Context
This repository evaluates an ASR (Automatic Speech Recognition) system for Swahili children completing EGRA-style speech tests.
Each child performs 42 tests, grouped into 7 categories (T1–T7). The mapping from test → category is defined in the project README.
For each test item, we have three string forms:
Canonical — the target / intended word shown to the child.
Reference (REF) — what the child actually said, human-annotated.
Hypothesis (HYP) — what the ASR system predicted the child said.
The current evaluation pipeline already produces an egra_eval_summary.txt file with several metrics.
Additional metrics (described in the README under Task for New Metrics) and ensure they appear in the generated egra_eval_summary.txt, aggregated per test category (T1–T7) and overall.
New Metrics to Add
The README defines multiple phonological metrics that must now be computed using the canonical, reference, and hypothesis forms.
A typical example:
Substitution Precision Example
True substitutions = differences between reference and canonical Predicted substitutions = differences between hypothesis and canonical Metric = How well HYP predicts the same substitutions that REF made.
Each metric follows this pattern: Compare REF vs CANONICAL → child’s true phonological process Compare HYP vs CANONICAL → system’s predicted phonological process
Compute true positives, false positives, false negatives
Derive: Precision Recall F1-score
Counts as needed (TP, FP, FN)
These metrics must be computed inside each test category and optionally aggregated across all tests.
- Build the Docker image (CPU by default):
docker compose build - Prepare the dataset and model
- Copy the dataset (including
0_Audio/,2_TextGrid/,Student_Full_Canonical_EGRA_*.csv,Student_MetaData_EGRA_*.csvintoinput_output_data/input/<dataset_name>/. Use the oral passages file from this link and place it ininput_output_data/input/oral_passages.csv. - Download your NeMo ASR model (the default scripts expect Swahili_exp1_100epochs.nemo) and place it in
nemo_inference/models/.
- Copy the dataset (including
- Run inference
./run_inference.sh \ --dataset_root input_output_data/input/<dataset_name> \ --output_dir input_output_data/output/<dataset_name>/nemo_asr_output \ --model nemo_inference/models/<model>.nemo
Example
./run_inference.sh \
--dataset_root input_output_data/input/1_Batch2_Data_16spk_subset \
--output_dir input_output_data/output/1_Batch2_Data_16spk_subset/nemo_asr_output \
--model nemo_inference/models/Swahili_exp1_100epochs.nemo-
Run evaluation
./run_eval.sh \ --dataset_root input_output_data/input/<dataset_name> \ --passages_csv input_output_data/input/oral_passages.csv \ --nemo_manifest input_output_data/output/<dataset_name>/nemo_asr_output/transcriptions.jsonl
Example
./run_eval.sh \ --dataset_root input_output_data/input/1_Batch2_Data_16spk_subset \ --passages_csv input_output_data/input/oral_passages.csv \ --nemo_manifest input_output_data/output/1_Batch2_Data_16spk_subset/nemo_asr_output/transcriptions.jsonl
-
Inspect the outputs under
input_output_data/output/experiments/<experiment>/:egra_eval_detailed.csv(very detailed evaluation, all metrics for each audio file)egra_eval_summary.txt(6-line metrics global summary)- Summary folders:
can_ref/,can_hyp/,ref_hyp/
-
Explore results interactively
- Dependencies:
pip install streamlit pandas numpy(preferably inside a virtualenv).- Specific example:
python3 -m venv .venv_streamlit && . .venv_streamlit/bin/activate && pip install --upgrade pip setuptools wheel && pip install streamlit pandas numpy
- Specific example:
- Run:
streamlit run egra_dashboard.py -- --csv <path/to/egra_eval_detailed.csv>- Specific example:
. .venv_streamlit/bin/activate && streamlit run egra_dashboard.py -- --csv input_output_data/output/experiments/exp1/egra_eval_detailed.csv
- Specific example:
- Open the browser tab (Streamlit serves on
http://localhost:8501by default) to sort, group and aggregate metrics.
- Dependencies:
Everything runs in Docker setup (CPU-only or GPU-enabled).
- Straight forward steps
- Contents
- Project structure
- What the pipeline does
- Input data format
- How to run (Docker)
- Outputs & how to interpret them
- Metrics & definitions
- Configuration knobs
- Troubleshooting
- Source files
.
├── docker/
│ └── Dockerfile # Base image with PyTorch, NeMo, audio libs, pandas, jiwer, praatio, librosa, etc.
├── docker-compose.yml # Compose with two services: nemo-asr (inference), egra-eval (evaluation)
├── evaluation.py # Main entrypoint for evaluation & summaries
├── infer.py # Main entrypoint for NeMo-based transcription
├── run_eval.sh # Wrapper script for evaluation (dataset_root + output_root mandatory)
├── run_inference.sh # Wrapper script for inference (dataset_root + output_root + model mandatory)
├── egra_eval/
│ ├── data/
│ │ ├── linking.py # Build keys, attach HYPs to EGRA rows
│ │ ├── nemo_manifest.py # Load NeMo manifests (JSONL)
│ │ ├── passage_merge.py # Fill missing canonical passages from CSV
│ │ └── textgrid_io.py # Read REF text from TextGrid tiers (recursive search, filler-tag filtering)
│ ├── eval/
│ │ └── run_eval.py # Core scoring module (CAN/REF/HYP)
│ ├── metrics/
│ │ └── scoring.py # Normalization + WER counts + ACC, P/R/F1
│ ├── normalize/
│ │ └── textnorm.py # Simple text normalization (lowercase, remove punctuation, collapse spaces)
│ └── report/
│ └── summarize.py # Summaries: macro, per-learner, overall, etc.
├── tools/ # Helper scripts (NeMo manifest prep, comparisons, etc.)
├── input_output_data/
│ ├── input/ # place each dataset folder for every experiment here
│ └── output/ # experiment results (one subfolder per run)
└── nemo_inference/
├── models/ # NeMo .nemo models (mounted read-only in container)
└── tmp/ # Temporary 16kHz segments dumped during inference (passage slicing). Use "debug" argument for inference (infer.py) in order to keep them for inspection.
Inference (infer.py)
- Recursively scans the dataset root (either
--dataset_rootor--root_audio_dir) for.wavfiles. - Resamples audio to 16 kHz as needed and, if a matching TextGrid exists (default tier
child), slices the audio according to the intervals before transcription. - Emits a NeMo-style JSONL manifest containing
audio_filepath,durationandpred_text.
Evaluation (evaluation.py)
- Discovers the student CSVs, audio and TextGrid folders from
--dataset_root(or explicit--egra_csv,--meta_csv, etc.). - Recursively searches
2_TextGrid/for.TextGridfiles, selects the first match for each audio stem, and strips filler tags such as<unk>,<noise>,<um>, etc. from the REF transcript. - Automatically normalizes canonical letter prompts so consonants receive a trailing
a(e.g.,g -> ga) prior to scoring. - Attaches ASR hypotheses from the provided manifest(s) and computes metrics for:
- CAN vs REF (annotator-based EGRA).
- CAN vs HYP (ASR-based EGRA).
- REF vs HYP (ASR quality vs human).
- Produces a detailed CSV, a 6-line text summary, and per-alignment summary folders (
can_ref/,can_hyp/,ref_hyp/).
input_output_data/input/ is intentionally empty. For each experiment copy exactly one dataset folder here. The scripts run inside the container, so use the mounted path prefix /io/input/<dataset> when supplying --dataset_root. A typical layout looks like this:
input_output_data/input/1_Batch2_Data-v2/
└── 1_Batch2_Data
├── 0_IAR
│ ├── 0_Audio/ # learner_id subfolders containing WAV files
│ └── 2_TextGrid/ # annotator folders
├── Student_Full_Canonical_EGRA_*.csv
└── Student_MetaData_EGRA_*.csv
- Only
0_Audio/,2_TextGrid/, the twoStudent_*CSVs, and the passages CSV are consumed; other folders (for example1_Annotation) are ignored. - Evaluation walks every subdirectory under
2_TextGrid/and chooses the first.TextGridwhose stem matches the audio; no annotator flag is required. Inference still accepts--dataset_annotatorif you want to limit slicing to a specific folder. - If the dataset root already contains
nemo_asr_output/transcriptions.jsonl,evaluation.pywill attach it automatically unless you override with--nemo_manifest.
You need Docker and (optionally) NVIDIA Container Toolkit for GPU.
CPU-only (default):
docker compose buildGPU-enabled build (CUDA 12.1 wheels):
docker compose build --build-arg TORCH_CUDA=cu121At runtime, enable GPU by uncommenting
gpus: "all"indocker-compose.yml(servicenemo-asr) or pass--gpus alltodocker compose run.
We provide run_inference.sh. It will:
- Discover audio/TextGrid folders from
--dataset_root(or use the explicit paths you supply). - Resample/slice audio as needed and run the NeMo model (
--model). - Write a manifest (
transcriptions.jsonl) under the chosen output folder. The script runsdocker compose runwith your currentuid:gid, so all generated files insideinput_output_dataare owned by the host user.
Usage:
./run_inference.sh \
--dataset_root /io/input/<dataset> \
--output_dir /io/output/<dataset>/nemo_asr_output \
--model /models/<model>.nemoTo use GPU at run time: add
--gpus allafterdocker compose runor enablegpus: "all"in the compose file.
We provide run_eval.sh. It will:
- Locate the canonical/meta CSVs plus the audio/TextGrid folders (via
--dataset_rootor explicit paths). - Attach ASR hypotheses from the given manifest(s).
- Read reference transcripts by searching all TextGrid folders and matching on audio stem.
- Produce the detailed CSV, the text summary, and per-pair summary folders in the chosen output directory. Like the inference wrapper, it executes the container with your user ID so the resulting CSVs and summaries remain writable without sudo.
Usage:
./run_eval.sh \
--dataset_root /io/input/<dataset> \
--output_root /io/output/<experiment> \
--passages_csv /io/input/<dataset>/oral_passages.csv \
--nemo_manifest /io/output/<dataset>/nemo_asr_output/transcriptions.jsonlrun_nemo_offline_eval.sh normalizes the dataset into NeMo manifests and invokes NVIDIA’s
speech_to_text_eval.py for both REF↔HYP and CAN↔HYP scoring.
Usage example:
./run_nemo_offline_eval.sh \
--dataset_root /io/input/1_Batch2_Data \
--dataset_annotator Flora \
--output_dir /io/output/1_Batch2_Data/nemo_asr_output \
--nemo_hyp_manifest /io/output/1_Batch2_Data-v2/nemo_asr_output/transcriptions.jsonlThis writes ref_manifest_norm.jsonl and can_manifest_norm.jsonl alongside the supplied output
directory and enriches them with per-sample NeMo WER scores. Pass either --output_dir or
--dataset_root; without one of these the script aborts.
All evaluation outputs land in the experiment folder you pass as <output_root>. Each experiment
folder contains:
-
egra_eval_detailed.csv— One row per EGRA item with:- Keys:
learner_id,audio_type,audio_file. - Texts:
CAN(canonical),REF(annotator),HYP(ASR). - CAN vs REF metrics:
WER_can_ref,ACC_can_ref (EGRA_ACC)plus countsS_can_ref,D_can_ref,I_can_ref,C_can_ref (EGRA_COR),N_can_ref. - CAN vs HYP metrics:
WER_can_hyp,ACC_can_hyp (ASR_EGRA_ACC)plus countsS_can_hyp,D_can_hyp,I_can_hyp,C_can_hyp (ASR_EGRA_COR),N_can_hyp. - REF vs HYP metrics:
WER_ref_hyp,ACC_ref_hypplus countsS_ref_hyp,D_ref_hyp,I_ref_hyp,C_ref_hyp,N_ref_hyp. - Per-row aggregates:
EGRA-COR,EGRA-ACC,ASR-EGRA-COR,ASR-EGRA-ACC,MAE_EGRA_COR,ASR_WER. - Column names that include aliases (e.g.,
ACC_can_ref (EGRA_ACC)) expose both the base metric and the specific EGRA naming. - WER and ACC values are percentages (0–100); the raw counts are absolute integers.
- Agreement:
MAE_EGRA_COR = |EGRA_COR − ASR_EGRA_COR|which represents the absolute difference in number of correct tokens between annotator-based and ASR-based evaluations. - All learner metadata merged in (e.g.,
gender,age).
- Keys:
-
egra_eval_summary.txt— Six-line global snapshot with the metricsEGRA-COR,EGRA-ACC,ASR-EGRA-COR,ASR-EGRA-ACC,MAE_EGRA_COR, andASR_WER(averages where applicable), rounded to two decimals. -
Pair-specific summary folders — within the same experiment directory you will find three subfolders:
Folder Alignment pair Files inside can_ref/Canonical vs Reference (annotator EGRA) egra_eval_summary_per_speaker_global.csv,egra_eval_summary_per_speaker_macro.csv,egra_eval_summary_per_speaker_subcat.csvcan_hyp/Canonical vs ASR hypothesis (automated EGRA) same filenames as above ref_hyp/Reference vs ASR hypothesis (ASR quality) same filenames as above Each summary file reports micro-averages derived from the raw counts:
*_per_speaker_global.csv— one row perlearner_idplus a leading__GLOBAL__row aggregating every sample.*_per_speaker_macro.csv— per learner × macro category (letters / syllables / nonwords / passage).*_per_speaker_subcat.csv— per learner × macro category × subcategory (e.g.,letters+isolated).
The columns mirror the metric block in the detailed CSV (WER, ACC, counts). Use them to compare annotator vs ASR EGRA scores or inspect performance by task type.
Use these artifacts to track:
- Human annotator performance (
can_ref). - Automated EGRA performance (
can_hyp). - ASR quality with respect to the human reference (
ref_hyp). - Agreement between automated and human EGRA via
MAE_EGRA_COR(closer to 0 is better).
All metrics are computed after text normalization (normalize/textnorm.py): NFC Unicode, lowercase, punctuation removed, whitespace collapsed.
We compute standard ASR alignment counts via jiwer:
- S — substitutions
- D — deletions
- I — insertions
- C — correct matches
- N — number of total reference tokens (groundtruth)
From those we derive:
- WER = (S + D + I) / N → reported in the CSVs as a percentage (value × 100).
- ACC = C / N → also reported as a percentage in the detailed and summary files.
We apply the same counts to derive EGRA-style KPIs:
-
EGRA (Annotator-based) from CAN vs REF
C_can_ref (EGRA_COR) = number of correct tokens = N_ref − S_can_ref − D_can_refACC_can_ref (EGRA_ACC) = EGRA_COR / N_ref
-
ASR-based EGRA from CAN vs HYP
C_can_hyp (ASR_EGRA_COR) = N_can − S_can_hyp − D_can_hypACC_can_hyp (ASR_EGRA_ACC) = ASR_EGRA_COR / N_can
-
Agreement between annotator- and ASR-based correctness
MAE_EGRA_COR = |EGRA_COR − ASR_EGRA_COR|
-
ASR quality snapshot
ASR_WER = WER_ref_hyp(same computation exposed for convenience in the detailed CSV and summary text).
-
ASR quality vs human from REF vs HYP
WER_ref_hyp,ACC_ref_hypand the count fieldsS_ref_hyp,D_ref_hyp,I_ref_hyp,C_ref_hyp,N_ref_hyp.
WER and ACC values are emitted as percentages (0.0–100.0). Count-based columns (S/D/I/C/N) remain raw integers.
| Metric | Description | Typical Range / Unit | Interpretation |
|---|---|---|---|
| WER_can_ref, WER_can_hyp, WER_ref_hyp | Word Error Rate (substitutions + deletions + insertions) / N | 0.0–100.0 (%); can exceed 100 with many insertions | Lower is better |
| ACC_can_ref (EGRA_ACC), ACC_can_hyp (ASR_EGRA_ACC), ACC_ref_hyp | Accuracy = C / N | 0.0–100.0 (%) | Higher is better |
| C_can_ref (EGRA_COR), C_can_hyp (ASR_EGRA_COR) | Correctness count = N − S − D | Integer ≥ 0 | Count of correct tokens |
| S_*, D_*, I_*, C_*, N_* | Alignment counts (Substitutions, Deletions, Insertions, Correct, Total) | Integers ≥ 0 | Raw counts |
| MAE_EGRA_COR | Absolute difference between EGRA_COR and ASR_EGRA_COR per row | Integer ≥ 0 | Lower indicates better agreement |
| ASR_WER | Word error rate from REF vs HYP (duplicate of WER_ref_hyp) |
0.0–100.0 (%) | Lower is better |
Note:
If the canonical or reference text has N = 0, ratio-based metrics (WER, ACC) are undefined and will appear as NaN in the output CSVs.
- Model path:
--model /models/your_model.nemo - Dataset root:
--dataset_root /io/input/<dataset>— required. - Annotator:
--dataset_annotator <annotatorName>to pick a specific annotator (defaults to the first alphabetically). - Output root:
--output_root /io/input/<dataset>/nemo_asr_output— wheretranscriptions.jsonlis written. - TextGrid tier:
--tier_name child; change this if your intervals live on a different tier. - CPU workers:
--cpu_workers Nsets the number of CPU threads used when no GPU is available. - Temp segments:
--tmp_dir /work/nemo_inference/tmplets you keep the 16 kHz segments around for debugging.
run_inference.shrequires the named options--dataset_root,--output_dir, and--model; add any extra flags after those.
No implicit defaults are applied to dataset/output paths—provide them explicitly.
Run python3 evaluation.py --help to see available options. Highlights:
--dataset_root /io/input/<dataset>— required; automatically discovers theStudent_*CSVs plus0_Audio/and2_TextGrid/.--passages_csv /io/input/<dataset>/oral_passages.csv— required; supplies the passage text mapping for passage tasks.--output_root /io/output/<experiment>— required; directory where results are written.--nemo_manifest /path/to/transcriptions.jsonl— attach one or more ASR manifests.--summary_can_ref_dir,--summary_can_hyp_dir,--summary_ref_hyp_dir— optional overrides for the summary output destinations.
- No GPU used: Ensure the image was built with
--build-arg TORCH_CUDA=cu121and you run with--gpus allorgpus: "all"in compose. - Empty or short
pred_text: Check that the model matches the language/domain. Also verify sample rate conversion (the script resamples to 16 kHz automatically). - Missing REF text: Ensure a
.TextGridwith the same stem as the audio exists somewhere under2_TextGrid/; the evaluator searches recursively but still needs matching filenames. - Passage text missing: Double-check that
--passages_csvpoints to the oral passages file bundled with the dataset. - Passage segmentation not applied: Make sure the TextGrid files contain the
childtier and that audio/TextGrid names align; if needed, point--tier_nameto the tier that carries spoken intervals. - Manifests don’t match: Joins default to the file stem; switch
--match_ontonameorpath(or rename files consistently) if the stems differ. - Permissions: The repo root and
input_output_dataare mounted read-write. Models are mounted read-only fromnemo_inference/models.
-
infer.py
Automatically discovers0_Audio/and2_TextGrid/under--dataset_root, resamples to 16 kHz, slices by TextGrid intervals when present, and writestranscriptions.jsonlto the output folder. -
evaluation.py
Orchestrates the evaluation pipeline: loads theStudent_*CSVs, attaches the ASR manifest, adds canonical passage and letter adjustments, searches2_TextGrid/recursively for matching.TextGridfiles, computes metrics, and writes the detailed CSV, text summary, and per-pair summaries. -
egra_eval/metrics/scoring.py
Wrapsjiwerto produce counts (S, D, I, C, N), WER and ACC (all expressed as percentages in downstream outputs). Usesnormalize/textnorm.pyfor simple text normalization. -
egra_eval/data/textgrid_io.py
Finds the requested tier case-insensitively (defaultchild), gathers labeled intervals, strips filler tags (<unk>,<noise>, etc.), and concatenates labels to form REF per item while searching recursively across annotator folders. -
egra_eval/data/linking.py
Builds join keys from the EGRA CSV (audio_name,audio_stem) and attaches ASR HYPs by the chosen key (stemby default). -
egra_eval/data/nemo_manifest.py
Loads one or many NeMo manifests (JSONL), extractingaudio_path,audio_name,audio_stemandhyp_text. -
egra_eval/data/dataset_layout.py
Utility helpers that discover dataset packages containing0_Audio/,2_TextGrid/and theStudent_*CSVs. -
egra_eval/data/passage_merge.py
Parses the passages CSV (various encodings handled), extractspassage_numand fills missingcanonical_textforpassage_numXrows. -
egra_eval/report/summarize.py
Builds micro-averaged summaries for each alignment pair:summary_for_pair(df, prefix, by=None)— aggregates metrics for one ofcan_ref,can_hyp, orref_hyp(optionally grouped by columns).summary_per_speaker(df, prefix)— per learner.summary_per_speaker_macro(df, prefix)— per learner × macro category.summary_per_speaker_subcategory(df, prefix)— per learner × macro category × subcategory.
-
docker/Dockerfile
Debian 12 base with PyTorch (CPU or CUDA), NeMo ASR 2.4.1 and all Python dependencies pinned for reproducibility. -
docker-compose.yml
Two services:nemo-asr: run inference (infer.py).egra-eval: run evaluation (evaluation.py). Mounts repo as/work, data as/io, models as/models, temp segments as/tmp_segments.
-
run_inference.sh/run_eval.sh
Thin wrappers to run the right compose service with the right command (evaluation now requires--passages_csv). -
run_nemo_offline_eval.sh
Generates normalized REF/CAN manifests and runs NVIDIA NeMo’s ownspeech_to_text_eval.pyscript for REF↔HYP and CAN↔HYP scoring. Handy for cross-checking the internal metrics against the official NeMo implementation.