Skip to content

A repository that implements a service for understanding, writing and recreating dead or almost-dead languages

License

Notifications You must be signed in to change notification settings

Azure-Samples/language-creation

Language Creation Platform

🧠 Business Problem

Thousands of indigenous and low‑resource languages remain predominantly oral. They lack: (1) a stable orthography, (2) documented semantic nuance across dialectal / contextual variation, and (3) morphosyntactic descriptions that machines can generalize. This scarcity blocks inclusive digital communication: voice messages cannot be transcribed reliably; phrases are mistranslated or flattened; synthesized speech fails to respect prosody, phonotactics, and cultural intent. Conventional supervised NLP pipelines assume sizable, curated parallel corpora—unavailable for spoken‑only or minimally documented languages. Without a principled way to bootstrap text, structure, and evaluation signals, speakers face a widening digital divide and risk accelerated language attrition.

🎯 One‑Sentence Approach

Iteratively generate and evaluate synthetic, culturally‑aligned multimodal (text + speech) corpora—grounded in emergent orthographic conventions—then fine‑tune ASR, translation, and speech synthesis models in a closed feedback loop using Azure AI services.

⚙️ Technical Solution (End‑to‑End Narrative)

The platform implements a three‑fold learning arc—orthographic, semantic, and morphosyntactic acquisition—via staged Azure‑powered pipelines that turn sparse oral inputs into a structured, evaluable, and expandable dataset.

1. Data Ingestion & Normalization

Raw materials (field audio, elicitation notes, community‑provided transcripts, researcher glosses) are ingested and chunked (ingest_data/) into manageable dialogue or utterance windows. Chunking standardizes context units for prompting and evaluation while preserving speaker / turn boundaries.

2. Orthographic Emergence & Transliteration Generation

Because many target languages have fluid or competing writing systems, we start by proposing candidate orthographic forms. Prompt templates (prompts.py) drive Azure OpenAI deployments (e.g. GPT‑4.1 family or Mistral 3B on Azure) to produce transliteration hypotheses conditioned on phonetic hints, example pairs, and evolving conventions. Treatments (controlled variations of parameters or prompt phrasing) systematically explore orthography space. Outputs are persisted as structured artefacts in transliterations/, versioned for reproducibility.

3. Evaluation Loop & Quality Signals

Generated transliterations are scored for internal consistency (character inventory adherence, syllable pattern regularity), cross‑speaker stability, and semantic fidelity. Evaluation prompts and heuristics form a feedback layer—accepting / rejecting or ranking candidates. High‑quality pairs (audio ↔ emergent text) accumulate into a seed corpus. This synthetic corpus becomes a stand‑in parallel dataset where none previously existed.

4. Semantic & Morphosyntactic Enrichment

Dialogues are expanded with glosses, part‑of‑speech conjectures, and morphological segmentation (when attainable) using iterative prompting + rule‑based post‑passes. The enriched artefacts supply latent grammatical structure, informing downstream translation disambiguation (tense, evidentiality, person markers) and improving language modeling robustness.

5. ASR Fine‑Tuning (Whisper or Azure Speech Models)

The curated audio/transliteration pairs feed fine‑tuning of Whisper (or Azure Cognitive Services Speech models where supported). Managed identity + DefaultAzureCredential enable secure dataset registration in Azure ML or Speech resource contexts. Fine‑tuning improves phoneme → grapheme mapping aligned with emergent orthography, raising transcription accuracy for future raw audio.

6. Translation Pipeline

With stabilized transcription, the translation layer (Azure OpenAI GPT‑4.1 or Mistral 3B deployment) transforms source sentences into a bridge language (e.g. Portuguese) and optionally a secondary lingua franca (e.g. English). Techniques include:

  • Few‑shot exemplars derived from highest‑confidence evaluation outputs.
  • Back‑translation to detect semantic drift.
  • Style / register controls ensuring culturally respectful phrasing.

Parallel text triples (original audio → emergent text → Portuguese translation) enhance semantic robustness and allow future alignment tasks or embedding indexing.

7. Speech Synthesis (SSML‑Driven)

Final translations are rendered back into audio using Azure Cognitive Services Speech with SSML markup capturing prosodic intent (pauses, emphasis) and phonetic approximations for indigenous phonemes. This enables bidirectional communicative flows: indigenous language → Portuguese (text/audio) and Portuguese → indigenous (synthetic audio), fostering inclusion and field validation loops.

8. Feedback & Continuous Improvement

Human validators (native speakers, linguists) review synthesized outputs, annotate misalignments (orthographic drift, semantic mismatch, morphological errors). These annotations seed incremental retraining cycles, refining transliteration prompts, evaluation heuristics, and translation exemplars. The system steadily converges toward a stable orthography + semantic mapping without large pre‑existing corpora.

9. Infrastructure as Code & Operational Backbone

Infrastructure is provisioned with Bicep (primary) or Terraform (alternative) under infra/. Modular templates instantiate: Azure OpenAI, Speech, AI Search (for future retrieval augmentation), Blob Storage (artefact persistence), Cosmos DB (optional metadata), VNet isolation, and observability (Log Analytics). Choosing one IaC stack per environment avoids drift. Secrets are never hard‑coded; DefaultAzureCredential orchestrates secure access paths.

10. Architecture & Implementation Practices

  • Language Models: Accessed only through Azure SDKs with DefaultAzureCredential.
  • Domain Modeling: Dataclasses capture artefacts (dialogues, transliterations, scoring) for purity & testability.
  • Boundaries: Pydantic used sparingly for configuration (configurations.py) and external I/O.
  • Pipelines: Composable, side‑effect isolated functions enabling deterministic unit tests (tests/).
  • Treatment Abstraction: Experiment wrapper allowing systematic hyperparameter / prompt variant exploration.
  • Storage Layer: blob.py provides unified artefact upload / retrieval to central containers for team collaboration.
  • Reproducibility: Versioned JSONL artefacts + logged evaluation metrics ensure traceable model improvement steps.

11. Ethical & Community Considerations

Localization choices (orthography, semantic framing) are made with community consultation. Synthetic data generation respects data sovereignty principles: locally sensitive artefacts can remain in secure storage partitions; public releases undergo redaction & consent review. The goal is augmentation, not replacement, of human linguistic stewardship.

12. Outcomes

The platform reduces barrier‑to‑entry for low‑resource language technology by bootstrapping a high‑quality multimodal corpus from near zero, enabling:

  • Accurate transcription of emergent orthography.
  • Faithful, culturally sensitive translation.
  • Natural speech synthesis for bidirectional communication.
  • An extensible foundation for future retrieval augmentation, morphological analyzers, and educational tools.

Applied AI pipelines for building, evaluating, and deploying indigenous / low‑resource language assets using Azure OpenAI & Azure Cognitive Services.

Modules for text generation, transliteration, translation, speech (STT / TTS) and infra-as-code (Bicep / Terraform) to stand up an end‑to‑end experimentation & deployment environment.


🚀 Overview

Many indigenous and low‑resource languages lack robust digital corpora and tooling. This repository provides a modular, testable sandbox to:

  • Ingest & chunk documents / transcripts
  • Generate candidate transliterations & evaluations
  • Run text generation prompts with systematic treatment control
  • Translate and synthesize speech (planned / in-progress modules)
  • Persist intermediate artefacts (JSONL / Blob) for reproducibility
  • Deploy required Azure resources via Bicep or Terraform

All application code favors:

  • Python 3.11+
  • Azure SDKs with DefaultAzureCredential
  • Small, composable, side‑effect isolated functions
  • Dataclasses for domain models; Pydantic only at boundaries (settings / I/O)

📁 Repository Structure

apps/
  speech_to_text/        # (Placeholder / future expansion for STT pipeline)
  text_generation/       # Active Python app for transliteration & prompt pipelines
  text_to_speech/        # (Placeholder / future TTS synthesis modules)
  translation/           # (Placeholder / future translation pipelines)
configs/                 # App‑level configuration (override / environment specific)
infra/
  bicep/                 # Azure Bicep templates (authoritative infra-as-code)
  terraform/             # Terraform templates (alternative IaC flavor)
  src/                   # Rust helper tooling (if any future infra utilities)
libs/                    # (Reserved for shared Python / Rust libraries)
scripts/                 # Helper scripts (data, ops, maintenance)

Key active code today lives in apps/text_generation/src:

File / Dir Purpose
main.py Entry point / CLI orchestration for generation tasks
configurations.py Settings & environment loading (Pydantic / dataclass hybrids)
blob.py Azure Blob Storage adapter utilities
prompts.py Prompt templates & few‑shot scaffolds
treatment.py Experimental “treatment” abstraction for controlled prompt variants
utils.py Cross‑cutting helpers (logging, file IO, etc.)
domain/ Domain model dataclasses (dialogue, transliteration artefacts, scoring)
ingest_data/ Ingestion / chunking scripts (PDF, text corpora)
pipelines/ Orchestrated multi‑step pipelines (generation, evaluation)
tests/ Pytest-based unit tests (add here!)
transliterations/ Produced JSON artefacts and progressive conversation logs

🧩 Core Concepts

  • Chunk – A window of pages or sentences used for context retrieval & indexing.
  • Dialogue – Ordered role-based utterances (system / user / assistant) for generation.
  • Transliteration – Produced candidate in target writing system / orthography.
  • Evaluation – Numeric + qualitative judgement of produced transliterations.
  • Treatment – A controlled variant of a prompt or parameter setting under test.

🔧 Prerequisites

Component Requirement
Python 3.11+ (recommend 3.11.x)
Package Manager Poetry (preferred) OR editable install fallback
Azure Subscription Required for deploying infra + using Azure OpenAI / Cognitive Services
Azure CLI az login for local credential chain
Permissions Ability to create resource group, storage, Azure OpenAI, AI Search (optional)

Ensure you can obtain a token via DefaultAzureCredential:

az login
az account show

🛠️ Installation (Text Generation App)

Clone & bootstrap:

git clone https://github.com/Azure-Samples/language-creation.git
cd language-creation/apps/text_generation
poetry install  # or: pip install -e .

Activate environment (Poetry):

poetry shell

If using pip fallback at repo root:

pip install -e apps/text_generation

⚙️ Configuration

The application resolves settings from (priority order):

  1. Environment variables
  2. .env file (if implemented) / local overrides in configs/text_generation/
  3. Sensible defaults inside configurations.py

Typical environment variables:

Variable Description
AZURE_OPENAI_ENDPOINT Your Azure OpenAI endpoint URL
AZURE_OPENAI_API_VERSION Supported API version (e.g. 2024-05-01-preview)
AZURE_OPENAI_DEPLOYMENT Model deployment name (e.g. gpt-4o-mini)
AZURE_STORAGE_ACCOUNT Blob Storage account name
AZURE_STORAGE_CONTAINER Blob container for artefacts
AZURE_COG_SERVICES_REGION Region for cognitive services endpoints

Create a local .env (do not commit secrets):

AZURE_OPENAI_ENDPOINT=https://<your-openai>.openai.azure.com/
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_STORAGE_ACCOUNT=<acct>
AZURE_STORAGE_CONTAINER=transliterations

DefaultAzureCredential will automatically try (in order) environment, managed identity, Visual Studio / VS Code, Azure CLI, etc.


▶️ Running Pipelines

While a unified CLI runner is evolving, a common pattern from apps/text_generation root:

poetry run python -m src.main --help

Examples (illustrative; adjust to actual implemented arguments):

# Generate transliterations for a dialogue JSONL
poetry run python -m src.main transliterate \  
  --input data/dialogues/example.jsonl \  
  --output transliterations/run_2025_09_24.jsonl

# Evaluate previously generated transliterations
poetry run python -m src.main evaluate --input transliterations/run_2025_09_24.jsonl

Artifacts accumulate in transliterations/ with versioned naming (transliteration_<session>.<step>.json).


Fine-tuning prerequisite: Before running the translation fine-tuning runner, enable managed identity access for the workspace system datastores so dataset uploads succeed:

az ml workspace update --resource-group <resource-group> --name <workspace-name> --system-datastores-auth-mode identity

Replace <resource-group> and <workspace-name> with your values. The flag is currently in preview; rerun the command after recreating the workspace.

After enabling identity mode, grant the workspace’s managed identity access to the backing storage account so dataset uploads succeed:

$workspaceId = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.MachineLearningServices/workspaces/<workspace-name>"
$principal = (az resource show --ids $workspaceId --query identity.principalId -o tsv)
$storageScope = "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Storage/storageAccounts/<storage-account>"
az role assignment create --assignee $principal --role "Storage Blob Data Contributor" --scope $storageScope
az role assignment create --assignee $principal --role "Storage Blob Data Owner" --scope $storageScope
az role assignment create --assignee-object-id (az ad signed-in-user show --query id -o tsv) --role "Storage Blob Data Contributor" --scope $storageScope

Swap in your identifiers; wait for RBAC propagation (typically < 1 minute) before retrying fine-tuning. If you run the fine-tuning script locally, grant your signed-in user at least contributor rights as shown above so uploads succeed with DefaultAzureCredential.


🧪 Testing

cd apps/text_generation
poetry run pytest -q

Add new tests under apps/text_generation/src/tests/ following the existing style; keep them fast and deterministic (mock network I/O). For integration tests requiring real Azure resources, gate with an env var (e.g. RUN_INTEGRATION=1).


🗂️ Data & Logs

Location Purpose
transliterations/*.json Generated intermediate & final outputs
transliterations/conversation_progress.jsonl Progressive logging of dialogues across steps
app.log Rotating / append log (ensure .gitignore if large)

Consider exporting curated datasets to Blob Storage via blob.py utilities for collaboration.


🏗️ Infrastructure as Code

Two parallel stacks are provided; pick one per deployment to avoid drift:

Bicep (Primary)

infra/bicep/main.bicep composes modules under infra/bicep/modules/.

Deploy (example):

$rg = "lang-create-rg"
$loc = "eastus"
az group create -n $rg -l $loc
az deployment group create -g $rg -f infra/bicep/main.bicep -p environment=dev

Terraform (Alternative)

In infra/terraform/:

cd infra/terraform
terraform init
terraform plan -out tfplan
terraform apply tfplan

Keep state secure (Azure Storage backend recommended); rotate secrets via Key Vault.


🔐 Security & Compliance

  • Never commit secrets—use Azure Key Vault + DefaultAzureCredential.
  • Enable audit logging on Storage & OpenAI usage.
  • Consider content filtering / abuse monitoring for generated text.
  • Isolate experimentation vs production resource groups.

🧭 Development Guidelines

  • Keep functions pure (no network / disk inside core logic where feasible).
  • Dataclasses for domain objects; serialize via helper functions.
  • Adopt “treatment” pattern for any experiment with prompt variants or temperature changes.
  • Limit PRs to < 400 changed lines (scaffold exceptions allowed).
  • Prefer incremental refactors (scaffold → adapters → pipelines → tests → hardening).

🗺️ Roadmap (Indicative)

Phase Focus
1 Solidify text generation + transliteration evaluation loop
2 Add translation pipeline with back‑translation quality checks
3 Introduce STT/TTS (batch mode) with caching & alignment metadata
4 Retrieval augmentation / vector indexing for context grounding
5 Web or CLI UX polish & model evaluation dashboards

🤝 Contributing

  1. Fork & branch from main (naming: feature/<slug> or fix/<slug>)
  2. Keep commits focused & descriptive
  3. Add / update tests for behavioral changes
  4. Run pytest -q before opening PR
  5. Provide brief rationale & screenshots / sample artefacts if relevant

See CONTRIBUTING.md for fuller guidance (if present / to be expanded).


📝 License

This project is licensed under the terms of the MIT License. See LICENSE.md.


🙋 FAQ (Quick Picks)

Question Answer
Why both Bicep & Terraform? Demonstrates equivalent IaC approaches—choose one to avoid drift.
Where are prompts stored? Centralized in prompts.py (consider externalizing to YAML later).
How do I add a new model? Add deployment in Azure OpenAI, expose its name via env var / config, update settings class.
How do I persist artefacts? Use blob.py helpers with DefaultAzureCredential; keep local copies under transliterations/.

🔍 Troubleshooting

Symptom Likely Cause Fix
Credential error Not logged into Azure CLI az login then retry
403 on OpenAI call Missing role / assignment Assign Cognitive Services OpenAI User role
Slow runs Large context / high iteration prompts Reduce chunk size or parallelize treatments
Terraform drift Mixed IaC usage Standardize on Bicep OR Terraform per environment

📚 Additional Resources


✅ Summary

You now have the scaffolding to ingest, generate, evaluate, and iterate on low‑resource language artefacts with reproducible infrastructure. Extend by adding domain models, refining evaluation metrics, or integrating retrieval augmentation.

Feel free to open issues for missing docs or suggested improvements.


Maintained by the community with ❤️. Contributions welcome.

About

A repository that implements a service for understanding, writing and recreating dead or almost-dead languages

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •