[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

Authors: Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain

Paper | Dataset

📰 News

[2025.09] Paper accepted to ArabicNLP 2025 (Workshop at EMNLP 2025)!
[2025.09] Released DiDeMo-AR dataset with 40,144 Arabic captions
[2025.09] AutoArabic released

🎯 Highlights

🌍 DiDeMo-AR: First Arabic video-text retrieval benchmark with 40,144 fluent MSA descriptions
🚀 High accuracy in automatic error detection for translation validation
⚡ 4x reduction in manual revision effort through intelligent LLM-based workflow
🔧 Modular framework supporting multiple LLM providers and languages

🔍 Introduction

Video-text retrieval has been dominated by English benchmarks (DiDeMo, MSR-VTT, VATEX), leaving Arabic severely underserved. AutoArabic addresses this gap with a three-stage framework that:

Translates non-Arabic benchmarks to Modern Standard Arabic using state-of-the-art LLMs
Detects translation errors automatically
Routes flagged samples to expert annotators for postediting

Applied to DiDeMo, we created DiDeMo-AR, the first Arabic video retrieval benchmark.

📊 Comparison with Existing Benchmarks

Dataset	Videos	Clip Length	Languages	Moment-level	Arabic
MSR-VTT	10,000	15s	EN	❌	❌
VATEX	41,250	10s	EN/ZH	❌	❌
DiDeMo	10,464	30s	EN	✅	❌
RUDDER	100k/lang	5-10s	EN/ZH/FR/DE/RU	❌	❌
DiDeMo-AR	10,464	30s	AR	✅	✅

⚙️ Installation

Requirements

Python ≥ 3.9, CUDA (recommended).
Install from requirements.txt in the repo root.

Setup

# Clone the repository
git clone https://github.com/Tahaalshatiri/AutoArabic.git
cd AutoArabic

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download DiDeMo videos (optional, ~50GB)
bash scripts/download_videos.sh --dataset didemo --output ./videos

🚀 Quick Start

A) Translation & Error Detection on new data

Set API keys in environment (or pass as args): OPENAI_API_KEY, GEMINI_API_KEY.

Input format (CSV/JSON):

CSV: one row per caption. Default columns are desc (EN text) and video_id. You can override with desc_column and id_column.
JSON: list of objects like {"video_id": "...", "desc": "..."}. You can override the JSON keys with the same desc_column / id_column arguments.

The tool checkpoints progress every k samples and can resume by passing the intermediate file again. If an ID column/key is provided, we use it to skip already-processed rows on resume.

from autoarabic import TranslationPipeline

pipeline = TranslationPipeline(
    provider="gemini",                 # "gemini" | "openai" | "hf"
    model="gemini-2.0-flash",         # provider-specific model name when provider != "hf"
    hf_model=None,                    # e.g., "unsloth/gemma-3-(1/4/12/27)b-it" when provider="hf"
    error_detector="gpt-4o",          # optional: "gpt-4o" | "gemini-2.0-flash" | unsloth/gemma-3-(1/4/12/27)b-it
    openai_key="...",                 # or set via env OPENAI_API_KEY
    gemini_key="...",                 # or set via env GEMINI_API_KEY
    batch_size=1,
    temperature=0.7,
    device="cuda",                    # "cuda" | "cpu", used when provider="hf"
    save_every=5,                   # checkpoint every k samples
    resume_from=None                  # resume from intermediate JSON/CSV
)

results = pipeline.translate_dataset(
    input_file="path/to/english_captions.json",  # accepts CSV or JSON
    output_file="path/to/arabic_captions.json",
    enable_error_detection=True,
    desc_column="desc",               # default "desc"
    id_column="video_id"              # optional; used for resume/skip
)

print("processed:", results["total_processed"])
print("errors flagged:", results["errors_flagged"])

B) Using DiDeMo/DiDeMo-AR dataset

from autoarabic.datasets import DiDeMoAR

dataset = DiDeMoAR(
  root_dir="./dataset/json",      # path to JSON files
  video_root="./videos",     # path to video files  
  split="train",                   # "train" | "val" | "test"
  language="ar",                   # "ar" (localized) or "en" (original)
)

sample = dataset[0]
print(sample)

C) Reproducing the Paper’s CLIP Results

Ensure DiDeMo videos exist at "video_root" and data in json format exist at "data_root"
Run the full pipeline (training + evaluation)

python scripts/run_paper.py --data_root ./dataset --video_root ./videos --language ar --vit_size 16

Configure language (ar/en), ViT size (16/32), epochs, and checkpoint paths as needed.
Example hardware: 1× A100 (~20 min/epoch)

If you need the original DiDeMo download utilities, see the dataset’s official repository (contains JSONs and helper scripts). (link)

✅ Performance

📝 Citation

If you use AutoArabic or DiDeMo-AR in your research, please cite:

@misc{eltahir2025autoarabic,
  title        = {AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks},
  author       = {Mohamed Eltahir and Osamah Sarraj and Abdulrahman Alfrihidi and Taha Alshatiri and Mohammed Khurd and Mohammed Bremoo and Tanveer Hussain},
  year         = {2025},
  eprint       = {2509.16438},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  doi          = {10.48550/arXiv.2509.16438},
  url          = {https://arxiv.org/abs/2509.16438}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

📰 News

🎯 Highlights

🔍 Introduction

⚙️ Installation

Requirements

Setup

🚀 Quick Start

A) Translation & Error Detection on new data

B) Using DiDeMo/DiDeMo-AR dataset

C) Reproducing the Paper’s CLIP Results

✅ Performance

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
autoarabic		autoarabic
dataset		dataset
evaluation		evaluation
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Tahaalshatiri/AutoArabic

Folders and files

Latest commit

History

Repository files navigation

[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

📰 News

🎯 Highlights

🔍 Introduction

⚙️ Installation

Requirements

Setup

🚀 Quick Start

A) Translation & Error Detection on new data

B) Using DiDeMo/DiDeMo-AR dataset

C) Reproducing the Paper’s CLIP Results

✅ Performance

📝 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages