Tenepal

Phoneme-based language identification for Nahuatl-first evaluation in film audio.

Tenepal identifies what language is being spoken in audio, with the current public-facing evaluation focused on Nahuatl vs. Spanish. It works by analyzing the raw phoneme stream via universal phoneme recognition and matching against phonotactic profiles, with prosodic fusion as a second evidence channel. Maya support remains exploratory and is not yet presented here as a release-ready benchmark.

Tenepal — associated with Nahuatl senses around "the tongue," eloquence, or facility with words. This project name follows the Malintzin Tenepal discussion summarized on the Wikipedia page for La Malinche and the scholarship cited there, especially Frances Karttunen and James Lockhart. The historical tenepal mediated between languages and cultures in conquest-era Mesoamerica.

Annotator

Segment-level review of Nahuatl/Spanish predictions on La Otra Conquista. The annotator tool is in tools/annotator/.

Key Results

Metric	Value	Description
Hernán subset accuracy	85.7%	Best configuration on 551 annotated NAH+SPA segments from `Hernán-1-3`
Phoneme-only baseline	65.7%	Same 551-segment subset, without Whisper or speaker priors
Cross-film LOC accuracy	84.4% raw / 81.7% balanced	244 annotated NAH+SPA segments from La Otra Conquista (minutes 14-44)
Nahuatl ASR CER	108% → 70%	Whisper-large-v3 baseline vs. LoRA finetune on OpenSLR-92 test sample

See PAPER.md for the full technical write-up, docs/AMITH_CORPORA.md for corpus access instructions, and EVOLUTION.md for the research journal.

Metric provenance:

85.7%, 65.7%, and the 551-segment Hernán subset are the current audited benchmark numbers documented in EVOLUTION.md and exported via benchmarks/annotations/ plus benchmarks/reports/eq_comparison_gt.json.
84.4% raw / 81.7% balanced comes from the annotated La Otra Conquista subset exported under benchmarks/annotations/ and discussed in PAPER.md.
108% -> 70% CER is the current public draft finetuning result from PAPER.md, based on OpenSLR-92 test sampling.

Installation

pip install -e .

Optional dependencies for the full pipeline:

pip install -e ".[diarization]"    # pyannote.audio speaker diarization
pip install -e ".[g2p]"            # epitran grapheme-to-phoneme
pip install -e ".[transcription]"  # faster-whisper ASR
pip install praat-parselmouth      # prosody analysis
pip install demucs                 # vocal isolation

Quick Start

# Process audio file → SRT with language tags
tenepal batch input.wav --output output.srt

# Process with speaker diarization
tenepal batch input.wav --output output.srt --diarize

# Full pipeline with Whisper transcription
tenepal process input.wav --output output.srt --whisper-model small

# Live system audio capture
tenepal live

# Diagnostic: analyze phoneme distribution
tenepal analyze input.wav

# Docker GPU setup (for pyannote diarization)
tenepal setup-docker
tenepal doctor

Cloud Backends

Tenepal supports pluggable cloud compute for GPU-intensive pipeline stages:

# Modal (default)
modal run tenepal_modal.py::main --input audio.wav

# RunPod
TENEPAL_RUNTIME=runpod tenepal process audio.wav

# Docker (local GPU)
tenepal batch audio.wav --docker

The runtime provider abstraction (src/tenepal/runtime/) auto-detects available backends or can be configured via TENEPAL_RUNTIME environment variable.

Architecture

src/tenepal/
├── audio/            # Audio loading, format conversion, preprocessing
├── phoneme/          # Universal phoneme recognition (Allosaurus + text-to-IPA)
├── language/         # Language identification (profiles, scoring, smoothing)
├── transcription/    # Whisper integration and transcription routing
├── speaker/          # Speaker diarization (host + Docker backends)
├── prosody/          # Prosodic feature extraction (F0, rhythm)
├── fusion/           # Multi-evidence score fusion
├── preprocessing/    # Demucs vocal isolation, VAD segmentation
├── pronunciation/    # IPA → familiar-spelling rendering
├── morphology/       # Morphological analysis
├── orchestration/    # End-to-end pipeline orchestration
├── runtime/          # Cloud provider abstraction (Modal, RunPod)
├── validation/       # Scoring and evaluation framework
├── subtitle/         # SRT generation with language tags
├── docker/           # Docker GPU container utilities
├── data/             # Language profiles, epitran maps, lexicons
├── pipeline.py       # Core pipeline orchestration
└── cli.py            # Command-line interface

Pipeline Flow

Audio → ffmpeg → Demucs (vocal isolation) → Silero-VAD (segmentation)
  → per segment:
      Allosaurus → IPA → Phonotactic scoring ──┐
      Parselmouth → Prosody features ──────────┤→ Score fusion → Language ID
      Whisper → text (if language known) ──────┘
  → Speaker diarization (pyannote) → Speaker-level smoothing
  → SRT with [NAH|85%], [SPA|92%], [MAY|78%] language tags

EQ Configuration

The pipeline is tunable via JSON EQ config files that control scoring thresholds, fusion weights, and detection gates:

tenepal process audio.wav --eq eq_custom.json

See eq_default.json for all available parameters.

Tools

Tool	Path	Description
Annotator	`tools/annotator/`	Web-based annotation tool for ground-truth labeling
Evaluation	`evaluate.py` + `tools/corpus/`	Scoring scripts, corpus evaluation, hallucination statistics
Regression	`tools/regression/`	Clip-based regression test harness
Corpus	`tools/corpus/`	AMITH Zacatlan-Nahuatl corpus integration

Public-facing support docs:

docs/AMITH_CORPORA.md — where to get the Amith corpora and how local download works
docs/ANNOTATOR_SCREENSHOTS.md — guidance for safe public annotator screenshots
docs/OPEN_PROBLEMS.md — current highest-priority unresolved technical problems

Research

PAPER.md — Technical paper: methodology, experiments, results
EVOLUTION.md — Research journal: experiments, failures, and lessons learned chronologically

Datasets

Current release emphasis:

Hernán (2019) — the original project trigger and the main ablation benchmark; access is region-dependent and clips are not redistributed
La Otra Conquista (1999) — the primary public-facing Nahuatl/Spanish reference, told from an indigenous perspective
OpenSLR 92 Puebla-Nahuatl — training/evaluation corpus for Whisper finetuning
OpenSLR 147/148 + related lexical sources — auxiliary Nahuatl corpus and lexicon sources
Apocalypto / Maya materials — exploratory only; not yet a release-ready benchmark

Important release note: this repository does not grant rights to redistribute film audio, video, or subtitle files. The exact scenes used in experiments are documented in PAPER.md for reproduction using lawfully obtained source media.

Project framing note: Marina/Malinche remains the symbolic avatar of the project because of her role as a linguistic mediator in the conquest era, even though she does not directly appear in La Otra Conquista.

Source notes:

Malintzin / Tenepal naming discussion: https://en.wikipedia.org/wiki/La_Malinche
OpenSLR 92: https://openslr.org/92/
OpenSLR 147: https://openslr.org/147/
OpenSLR 148: https://openslr.org/148/

Acknowledgments

The Nahuatl lexicon and Whisper finetuning in this project build on speech corpora recorded and published by Jonathan D. Amith and collaborators (OpenSLR 92, 147, 148; Mozilla Common Voice Nahuatl). Without that fieldwork, none of this would work. See docs/AMITH_CORPORA.md for corpus details.

License

MIT License. See LICENSE.

Citation

@software{tenepal2026,
  title={Tenepal: Phoneme-Based Language Identification for Endangered Languages},
  author={Dresch, Markus},
  year={2026},
  url={https://github.com/markusd/tenepal}
}

Per ardua ad astra. Built for languages the world forgot to support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tenepal

Annotator

Key Results

Installation

Quick Start

Cloud Backends

Architecture

Pipeline Flow

EQ Configuration

Tools

Research

Datasets

Acknowledgments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmarks		benchmarks
docker		docker
docs		docs
eq_configs		eq_configs
scripts		scripts
src/tenepal		src/tenepal
tests		tests
tools		tools
.gitignore		.gitignore
EVOLUTION.md		EVOLUTION.md
LICENSE		LICENSE
PAPER.md		PAPER.md
README.md		README.md
eq_comparison_gt.json		eq_comparison_gt.json
eq_default.json		eq_default.json
eq_v7_best.json		eq_v7_best.json
evaluate.py		evaluate.py
pyproject.toml		pyproject.toml
run_eq_comparison.sh		run_eq_comparison.sh
tenepal_modal.py		tenepal_modal.py
tenepal_whisper_train.py		tenepal_whisper_train.py

Folders and files

Latest commit

History

Repository files navigation

Tenepal

Annotator

Key Results

Installation

Quick Start

Cloud Backends

Architecture

Pipeline Flow

EQ Configuration

Tools

Research

Datasets

Acknowledgments

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages