Thousands of indigenous and low‑resource languages remain predominantly oral. They lack: (1) a stable orthography, (2) documented semantic nuance across dialectal / contextual variation, and (3) morphosyntactic descriptions that machines can generalize. This scarcity blocks inclusive digital communication: voice messages cannot be transcribed reliably; phrases are mistranslated or flattened; synthesized speech fails to respect prosody, phonotactics, and cultural intent. Conventional supervised NLP pipelines assume sizable, curated parallel corpora—unavailable for spoken‑only or minimally documented languages. Without a principled way to bootstrap text, structure, and evaluation signals, speakers face a widening digital divide and risk accelerated language attrition.
Iteratively generate and evaluate synthetic, culturally‑aligned multimodal (text + speech) corpora—grounded in emergent orthographic conventions—then fine‑tune ASR, translation, and speech synthesis models in a closed feedback loop using Azure AI services.
The platform implements a three‑fold learning arc—orthographic, semantic, and morphosyntactic acquisition—via staged Azure‑powered pipelines that turn sparse oral inputs into a structured, evaluable, and expandable dataset.
Raw materials (field audio, elicitation notes, community‑provided transcripts, researcher glosses) are ingested and chunked (ingest_data/) into manageable dialogue or utterance windows. Chunking standardizes context units for prompting and evaluation while preserving speaker / turn boundaries.
Because many target languages have fluid or competing writing systems, we start by proposing candidate orthographic forms. Prompt templates (prompts.py) drive Azure OpenAI deployments (e.g. GPT‑4.1 family or Mistral 3B on Azure) to produce transliteration hypotheses conditioned on phonetic hints, example pairs, and evolving conventions. Treatments (controlled variations of parameters or prompt phrasing) systematically explore orthography space. Outputs are persisted as structured artefacts in transliterations/, versioned for reproducibility.
Generated transliterations are scored for internal consistency (character inventory adherence, syllable pattern regularity), cross‑speaker stability, and semantic fidelity. Evaluation prompts and heuristics form a feedback layer—accepting / rejecting or ranking candidates. High‑quality pairs (audio ↔ emergent text) accumulate into a seed corpus. This synthetic corpus becomes a stand‑in parallel dataset where none previously existed.
Dialogues are expanded with glosses, part‑of‑speech conjectures, and morphological segmentation (when attainable) using iterative prompting + rule‑based post‑passes. The enriched artefacts supply latent grammatical structure, informing downstream translation disambiguation (tense, evidentiality, person markers) and improving language modeling robustness.
The curated audio/transliteration pairs feed fine‑tuning of Whisper (or Azure Cognitive Services Speech models where supported). Managed identity + DefaultAzureCredential enable secure dataset registration in Azure ML or Speech resource contexts. Fine‑tuning improves phoneme → grapheme mapping aligned with emergent orthography, raising transcription accuracy for future raw audio.
With stabilized transcription, the translation layer (Azure OpenAI GPT‑4.1 or Mistral 3B deployment) transforms source sentences into a bridge language (e.g. Portuguese) and optionally a secondary lingua franca (e.g. English). Techniques include:
- Few‑shot exemplars derived from highest‑confidence evaluation outputs.
- Back‑translation to detect semantic drift.
- Style / register controls ensuring culturally respectful phrasing.
Parallel text triples (original audio → emergent text → Portuguese translation) enhance semantic robustness and allow future alignment tasks or embedding indexing.
Final translations are rendered back into audio using Azure Cognitive Services Speech with SSML markup capturing prosodic intent (pauses, emphasis) and phonetic approximations for indigenous phonemes. This enables bidirectional communicative flows: indigenous language → Portuguese (text/audio) and Portuguese → indigenous (synthetic audio), fostering inclusion and field validation loops.
Human validators (native speakers, linguists) review synthesized outputs, annotate misalignments (orthographic drift, semantic mismatch, morphological errors). These annotations seed incremental retraining cycles, refining transliteration prompts, evaluation heuristics, and translation exemplars. The system steadily converges toward a stable orthography + semantic mapping without large pre‑existing corpora.
Infrastructure is provisioned with Bicep (primary) or Terraform (alternative) under infra/. Modular templates instantiate: Azure OpenAI, Speech, AI Search (for future retrieval augmentation), Blob Storage (artefact persistence), Cosmos DB (optional metadata), VNet isolation, and observability (Log Analytics). Choosing one IaC stack per environment avoids drift. Secrets are never hard‑coded; DefaultAzureCredential orchestrates secure access paths.
- Language Models: Accessed only through Azure SDKs with
DefaultAzureCredential. - Domain Modeling: Dataclasses capture artefacts (dialogues, transliterations, scoring) for purity & testability.
- Boundaries: Pydantic used sparingly for configuration (
configurations.py) and external I/O. - Pipelines: Composable, side‑effect isolated functions enabling deterministic unit tests (
tests/). - Treatment Abstraction: Experiment wrapper allowing systematic hyperparameter / prompt variant exploration.
- Storage Layer:
blob.pyprovides unified artefact upload / retrieval to central containers for team collaboration. - Reproducibility: Versioned JSONL artefacts + logged evaluation metrics ensure traceable model improvement steps.
Localization choices (orthography, semantic framing) are made with community consultation. Synthetic data generation respects data sovereignty principles: locally sensitive artefacts can remain in secure storage partitions; public releases undergo redaction & consent review. The goal is augmentation, not replacement, of human linguistic stewardship.
The platform reduces barrier‑to‑entry for low‑resource language technology by bootstrapping a high‑quality multimodal corpus from near zero, enabling:
- Accurate transcription of emergent orthography.
- Faithful, culturally sensitive translation.
- Natural speech synthesis for bidirectional communication.
- An extensible foundation for future retrieval augmentation, morphological analyzers, and educational tools.
Applied AI pipelines for building, evaluating, and deploying indigenous / low‑resource language assets using Azure OpenAI & Azure Cognitive Services.
Modules for text generation, transliteration, translation, speech (STT / TTS) and infra-as-code (Bicep / Terraform) to stand up an end‑to‑end experimentation & deployment environment.
Many indigenous and low‑resource languages lack robust digital corpora and tooling. This repository provides a modular, testable sandbox to:
- Ingest & chunk documents / transcripts
- Generate candidate transliterations & evaluations
- Run text generation prompts with systematic treatment control
- Translate and synthesize speech (planned / in-progress modules)
- Persist intermediate artefacts (JSONL / Blob) for reproducibility
- Deploy required Azure resources via Bicep or Terraform
All application code favors:
- Python 3.11+
- Azure SDKs with
DefaultAzureCredential - Small, composable, side‑effect isolated functions
- Dataclasses for domain models; Pydantic only at boundaries (settings / I/O)
apps/
speech_to_text/ # (Placeholder / future expansion for STT pipeline)
text_generation/ # Active Python app for transliteration & prompt pipelines
text_to_speech/ # (Placeholder / future TTS synthesis modules)
translation/ # (Placeholder / future translation pipelines)
configs/ # App‑level configuration (override / environment specific)
infra/
bicep/ # Azure Bicep templates (authoritative infra-as-code)
terraform/ # Terraform templates (alternative IaC flavor)
src/ # Rust helper tooling (if any future infra utilities)
libs/ # (Reserved for shared Python / Rust libraries)
scripts/ # Helper scripts (data, ops, maintenance)
Key active code today lives in apps/text_generation/src:
| File / Dir | Purpose |
|---|---|
main.py |
Entry point / CLI orchestration for generation tasks |
configurations.py |
Settings & environment loading (Pydantic / dataclass hybrids) |
blob.py |
Azure Blob Storage adapter utilities |
prompts.py |
Prompt templates & few‑shot scaffolds |
treatment.py |
Experimental “treatment” abstraction for controlled prompt variants |
utils.py |
Cross‑cutting helpers (logging, file IO, etc.) |
domain/ |
Domain model dataclasses (dialogue, transliteration artefacts, scoring) |
ingest_data/ |
Ingestion / chunking scripts (PDF, text corpora) |
pipelines/ |
Orchestrated multi‑step pipelines (generation, evaluation) |
tests/ |
Pytest-based unit tests (add here!) |
transliterations/ |
Produced JSON artefacts and progressive conversation logs |
- Chunk – A window of pages or sentences used for context retrieval & indexing.
- Dialogue – Ordered role-based utterances (system / user / assistant) for generation.
- Transliteration – Produced candidate in target writing system / orthography.
- Evaluation – Numeric + qualitative judgement of produced transliterations.
- Treatment – A controlled variant of a prompt or parameter setting under test.
| Component | Requirement |
|---|---|
| Python | 3.11+ (recommend 3.11.x) |
| Package Manager | Poetry (preferred) OR editable install fallback |
| Azure Subscription | Required for deploying infra + using Azure OpenAI / Cognitive Services |
| Azure CLI | az login for local credential chain |
| Permissions | Ability to create resource group, storage, Azure OpenAI, AI Search (optional) |
Ensure you can obtain a token via DefaultAzureCredential:
az login
az account showClone & bootstrap:
git clone https://github.com/Azure-Samples/language-creation.git
cd language-creation/apps/text_generation
poetry install # or: pip install -e .Activate environment (Poetry):
poetry shellIf using pip fallback at repo root:
pip install -e apps/text_generationThe application resolves settings from (priority order):
- Environment variables
.envfile (if implemented) / local overrides inconfigs/text_generation/- Sensible defaults inside
configurations.py
Typical environment variables:
| Variable | Description |
|---|---|
AZURE_OPENAI_ENDPOINT |
Your Azure OpenAI endpoint URL |
AZURE_OPENAI_API_VERSION |
Supported API version (e.g. 2024-05-01-preview) |
AZURE_OPENAI_DEPLOYMENT |
Model deployment name (e.g. gpt-4o-mini) |
AZURE_STORAGE_ACCOUNT |
Blob Storage account name |
AZURE_STORAGE_CONTAINER |
Blob container for artefacts |
AZURE_COG_SERVICES_REGION |
Region for cognitive services endpoints |
Create a local .env (do not commit secrets):
AZURE_OPENAI_ENDPOINT=https://<your-openai>.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_STORAGE_ACCOUNT=<acct>
AZURE_STORAGE_CONTAINER=transliterationsDefaultAzureCredential will automatically try (in order) environment, managed identity, Visual Studio / VS Code, Azure CLI, etc.
While a unified CLI runner is evolving, a common pattern from apps/text_generation root:
poetry run python -m src.main --helpExamples (illustrative; adjust to actual implemented arguments):
# Generate transliterations for a dialogue JSONL
poetry run python -m src.main transliterate \
--input data/dialogues/example.jsonl \
--output transliterations/run_2025_09_24.jsonl
# Evaluate previously generated transliterations
poetry run python -m src.main evaluate --input transliterations/run_2025_09_24.jsonlArtifacts accumulate in transliterations/ with versioned naming (transliteration_<session>.<step>.json).
Fine-tuning prerequisite: Before running the translation fine-tuning runner, enable managed identity access for the workspace system datastores so dataset uploads succeed:
az ml workspace update --resource-group <resource-group> --name <workspace-name> --system-datastores-auth-mode identityReplace
<resource-group>and<workspace-name>with your values. The flag is currently in preview; rerun the command after recreating the workspace.After enabling identity mode, grant the workspace’s managed identity access to the backing storage account so dataset uploads succeed:
$workspaceId = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-name>" $principal = (az resource show --ids $workspaceId --query identity.principalId -o tsv) $storageScope = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>" az role assignment create --assignee $principal --role "Storage Blob Data Contributor" --scope $storageScope az role assignment create --assignee $principal --role "Storage Blob Data Owner" --scope $storageScope az role assignment create --assignee-object-id (az ad signed-in-user show --query id -o tsv) --role "Storage Blob Data Contributor" --scope $storageScopeSwap in your identifiers; wait for RBAC propagation (typically < 1 minute) before retrying fine-tuning. If you run the fine-tuning script locally, grant your signed-in user at least contributor rights as shown above so uploads succeed with
DefaultAzureCredential.
cd apps/text_generation
poetry run pytest -qAdd new tests under apps/text_generation/src/tests/ following the existing style; keep them fast and deterministic (mock network I/O). For integration tests requiring real Azure resources, gate with an env var (e.g. RUN_INTEGRATION=1).
| Location | Purpose |
|---|---|
transliterations/*.json |
Generated intermediate & final outputs |
transliterations/conversation_progress.jsonl |
Progressive logging of dialogues across steps |
app.log |
Rotating / append log (ensure .gitignore if large) |
Consider exporting curated datasets to Blob Storage via blob.py utilities for collaboration.
Two parallel stacks are provided; pick one per deployment to avoid drift:
infra/bicep/main.bicep composes modules under infra/bicep/modules/.
Deploy (example):
$rg = "lang-create-rg"
$loc = "eastus"
az group create -n $rg -l $loc
az deployment group create -g $rg -f infra/bicep/main.bicep -p environment=devIn infra/terraform/:
cd infra/terraform
terraform init
terraform plan -out tfplan
terraform apply tfplanKeep state secure (Azure Storage backend recommended); rotate secrets via Key Vault.
- Never commit secrets—use Azure Key Vault +
DefaultAzureCredential. - Enable audit logging on Storage & OpenAI usage.
- Consider content filtering / abuse monitoring for generated text.
- Isolate experimentation vs production resource groups.
- Keep functions pure (no network / disk inside core logic where feasible).
- Dataclasses for domain objects; serialize via helper functions.
- Adopt “treatment” pattern for any experiment with prompt variants or temperature changes.
- Limit PRs to < 400 changed lines (scaffold exceptions allowed).
- Prefer incremental refactors (scaffold → adapters → pipelines → tests → hardening).
| Phase | Focus |
|---|---|
| 1 | Solidify text generation + transliteration evaluation loop |
| 2 | Add translation pipeline with back‑translation quality checks |
| 3 | Introduce STT/TTS (batch mode) with caching & alignment metadata |
| 4 | Retrieval augmentation / vector indexing for context grounding |
| 5 | Web or CLI UX polish & model evaluation dashboards |
- Fork & branch from
main(naming:feature/<slug>orfix/<slug>) - Keep commits focused & descriptive
- Add / update tests for behavioral changes
- Run
pytest -qbefore opening PR - Provide brief rationale & screenshots / sample artefacts if relevant
See CONTRIBUTING.md for fuller guidance (if present / to be expanded).
This project is licensed under the terms of the MIT License. See LICENSE.md.
| Question | Answer |
|---|---|
| Why both Bicep & Terraform? | Demonstrates equivalent IaC approaches—choose one to avoid drift. |
| Where are prompts stored? | Centralized in prompts.py (consider externalizing to YAML later). |
| How do I add a new model? | Add deployment in Azure OpenAI, expose its name via env var / config, update settings class. |
| How do I persist artefacts? | Use blob.py helpers with DefaultAzureCredential; keep local copies under transliterations/. |
| Symptom | Likely Cause | Fix |
|---|---|---|
| Credential error | Not logged into Azure CLI | az login then retry |
| 403 on OpenAI call | Missing role / assignment | Assign Cognitive Services OpenAI User role |
| Slow runs | Large context / high iteration prompts | Reduce chunk size or parallelize treatments |
| Terraform drift | Mixed IaC usage | Standardize on Bicep OR Terraform per environment |
You now have the scaffolding to ingest, generate, evaluate, and iterate on low‑resource language artefacts with reproducible infrastructure. Extend by adding domain models, refining evaluation metrics, or integrating retrieval augmentation.
Feel free to open issues for missing docs or suggested improvements.
Maintained by the community with ❤️. Contributions welcome.