Skip to content

YomnaWaleed/egyptian-rag-translator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ›๏ธ Egyptian RAG Translator

Translate Earlier Egyptian transliterations to English using state-of-the-art AI and Retrieval-Augmented Generation (RAG).

Python 3.8+ License: MIT

๐Ÿ“– What is This?

This tool translates Ancient Egyptian transliterations (like แธฅtp dj njswt) into English through a sophisticated AI pipeline:

  1. Normalizes the Egyptian text
  2. Searches a database of 9,000 expert translations for similar examples
  3. Translates to German using a large language model with context
  4. Converts the German to English

Example:

Input:  แธฅtp dj njswt
Output: A sacrifice given by the King.

โšก Quick Start

Prerequisites

  • Python 3.8 or higher
  • An Ollama API key (Get one here)
  • 5GB free disk space

Ollama Model Setup (Required)

This project uses Ollama Cloud for LLM-based translation. Before running the system, you must download and enable the required model.

  1. Step 1: Install Ollama

Download and install Ollama from: https://ollama.com

Verify installation:

ollama --version
  1. Step 2: Pull the Required Model

Run the following command to download the model:

ollama pull qwen3-vl:235b-instruct-cloud

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/egyptian-rag-translator.git
cd egyptian-rag-translator
  1. Create virtual environment:
# Using uv (recommended - faster)
uv init
uv venv

# OR using standard Python
python -m venv .venv
  1. Activate environment:

Windows:

.venv\Scripts\activate

Linux/Mac:

source .venv/bin/activate
  1. Install dependencies:
# Using uv (faster)
uv pip install -r requirements.txt

# OR using pip
pip install -r requirements.txt
  1. Configure API key:

Create a .env file in the project root:

OLLAMA_API_KEY=your_api_key_here
  1. Setup the system (one command):
python setup.py

This will automatically:

  • Download the Egyptian dataset (~9,000 texts)
  • Process and clean the data
  • Generate AI embeddings (~30 minutes)
  • Build the search database

Note: The setup script is smart - it won't re-download or re-process if files already exist.

๐Ÿš€ Usage

Command Line

# Basic translation
python main.py "แธฅtp dj njswt"

# Quick mode (hide processing details)
python main.py "แธฅtp dj njswt" --no-details

Example output:

======================================================================
โœ… TRANSLATION COMPLETE
======================================================================
๐Ÿ›๏ธ Egyptian:  แธฅtp dj njswt
๐Ÿ”ค Normalized: htp dj njswt
๐Ÿ‡ฉ๐Ÿ‡ช German:    Ein Opfer, das der Kรถnig gibt.
๐Ÿ‡ฌ๐Ÿ‡ง English:   A sacrifice given by the King.
======================================================================

Python API

from src.pipeline.rag_pipeline import RAGPipeline

# Initialize the translator
pipeline = RAGPipeline()

# Translate
result = pipeline.translate("แธฅtp dj njswt", show_details=False)

if result['success']:
    print(f"English: {result['english']}")
    print(f"German:  {result['german']}")

Web User Interface

For a more user-friendly experience, launch the Gradio web UI:

python ui/app_gradio.py

Access at: http://localhost:7860

Features:

๐ŸŽน Egyptian keyboard - Click to type special characters ๐Ÿ”„ Real-time translation - Instant results ๐Ÿ” Retrieved examples - See which similar texts were used โš™๏ธ Integrated setup - Run setup from the UI ๐Ÿ“– Example phrases - Try common Egyptian texts

Quick workflow:

Open UI in browser Enter text: แธฅtp dj njswt (type or use keyboard) Click "๐Ÿ”„ Translate" View German & English translations Expand "Retrieved Examples" to see RAG context

See UI Guide for detailed instructions.

๐Ÿ“Š Performance

Our RAG system significantly outperforms direct LLM translation:

Metric RAG System LLM-Only Difference Improvement
BLEU 23.70% 3.22% +20.48% +636%
ROUGE-1 53.93% 22.08% +31.85% +144%
ROUGE-2 36.53% 5.51% +31.02% +563%
ROUGE-L 52.31% 19.77% +32.54% +165%
METEOR 39.32% 12.83% +26.49% +206%
chrF 45.35% 17.34% +28.01% +162%
Exact Match 9.89% 0.00% +9.89% โˆž
Word Overlap 43.36% 18.43% +24.93% +135%

Tested on 91 samples from the TLA dataset

Why RAG is Better

  • โœ… 20-32% higher accuracy across all metrics
  • โœ… Contextual understanding from 9,000 reference translations
  • โœ… Grammatical consistency through example matching
  • โœ… No hallucinations - grounded in real expert translations

๐Ÿ”ง Configuration

Edit .env to customize:

# Required
OLLAMA_API_KEY=your_key

# Optional (defaults shown)
LLM_MODEL=qwen3-vl:235b-instruct-cloud
EMBEDDING_MODEL=BAAI/bge-m3
TOP_K_RESULTS=30

๐Ÿ“š Dataset

Uses the Thesaurus Linguae Aegyptiae (TLA) dataset:

  • 9,000+ Earlier Egyptian texts
  • Old Egyptian & Early Middle Egyptian periods
  • Expert-curated translations
  • Linguistic annotations (lemmas, POS tags, glossing)

Source: thesaurus-linguae-aegyptiae

โ“ Troubleshooting

"OLLAMA_API_KEY not found"

Make sure you created a .env file with your API key.

"Dataset download failed"

Check your internet connection. The dataset is ~50MB.

"Embedding generation is slow"

This is normal - generating 9,000 embeddings takes ~30 minutes. It only runs once.

"Translation quality is poor"

  • Make sure setup.py completed successfully
  • Try increasing TOP_K_RESULTS in .env (default: 30)
  • Check that your Ollama API key is valid

๐Ÿ†˜ Support

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ™ Acknowledgments


Note: This is a research tool. For critical academic work, always verify translations with Egyptology experts.