Skip to content

[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

Notifications You must be signed in to change notification settings

Tahaalshatiri/AutoArabic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

Authors: Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain

📰 News

  • [2025.09] Paper accepted to ArabicNLP 2025 (Workshop at EMNLP 2025)!
  • [2025.09] Released DiDeMo-AR dataset with 40,144 Arabic captions
  • [2025.09] AutoArabic released

🎯 Highlights

  • 🌍 DiDeMo-AR: First Arabic video-text retrieval benchmark with 40,144 fluent MSA descriptions
  • 🚀 High accuracy in automatic error detection for translation validation
  • 4x reduction in manual revision effort through intelligent LLM-based workflow
  • 🔧 Modular framework supporting multiple LLM providers and languages

🔍 Introduction

Video-text retrieval has been dominated by English benchmarks (DiDeMo, MSR-VTT, VATEX), leaving Arabic severely underserved. AutoArabic addresses this gap with a three-stage framework that:

  1. Translates non-Arabic benchmarks to Modern Standard Arabic using state-of-the-art LLMs
  2. Detects translation errors automatically
  3. Routes flagged samples to expert annotators for postediting

Applied to DiDeMo, we created DiDeMo-AR, the first Arabic video retrieval benchmark.

📊 Comparison with Existing Benchmarks
Dataset Videos Clip Length Languages Moment-level Arabic
MSR-VTT 10,000 15s EN
VATEX 41,250 10s EN/ZH
DiDeMo 10,464 30s EN
RUDDER 100k/lang 5-10s EN/ZH/FR/DE/RU
DiDeMo-AR 10,464 30s AR

⚙️ Installation

Requirements

  • Python ≥ 3.9, CUDA (recommended).
  • Install from requirements.txt in the repo root.

Setup

# Clone the repository
git clone https://github.com/Tahaalshatiri/AutoArabic.git
cd AutoArabic

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download DiDeMo videos (optional, ~50GB)
bash scripts/download_videos.sh --dataset didemo --output ./videos

🚀 Quick Start

A) Translation & Error Detection on new data

Set API keys in environment (or pass as args): OPENAI_API_KEY, GEMINI_API_KEY.

Input format (CSV/JSON):

  • CSV: one row per caption. Default columns are desc (EN text) and video_id. You can override with desc_column and id_column.
  • JSON: list of objects like {"video_id": "...", "desc": "..."}. You can override the JSON keys with the same desc_column / id_column arguments.

The tool checkpoints progress every k samples and can resume by passing the intermediate file again. If an ID column/key is provided, we use it to skip already-processed rows on resume.

from autoarabic import TranslationPipeline

pipeline = TranslationPipeline(
    provider="gemini",                 # "gemini" | "openai" | "hf"
    model="gemini-2.0-flash",         # provider-specific model name when provider != "hf"
    hf_model=None,                    # e.g., "unsloth/gemma-3-(1/4/12/27)b-it" when provider="hf"
    error_detector="gpt-4o",          # optional: "gpt-4o" | "gemini-2.0-flash" | unsloth/gemma-3-(1/4/12/27)b-it
    openai_key="...",                 # or set via env OPENAI_API_KEY
    gemini_key="...",                 # or set via env GEMINI_API_KEY
    batch_size=1,
    temperature=0.7,
    device="cuda",                    # "cuda" | "cpu", used when provider="hf"
    save_every=5,                   # checkpoint every k samples
    resume_from=None                  # resume from intermediate JSON/CSV
)

results = pipeline.translate_dataset(
    input_file="path/to/english_captions.json",  # accepts CSV or JSON
    output_file="path/to/arabic_captions.json",
    enable_error_detection=True,
    desc_column="desc",               # default "desc"
    id_column="video_id"              # optional; used for resume/skip
)

print("processed:", results["total_processed"])
print("errors flagged:", results["errors_flagged"])

B) Using DiDeMo/DiDeMo-AR dataset

from autoarabic.datasets import DiDeMoAR

dataset = DiDeMoAR(
  root_dir="./dataset/json",      # path to JSON files
  video_root="./videos",     # path to video files  
  split="train",                   # "train" | "val" | "test"
  language="ar",                   # "ar" (localized) or "en" (original)
)

sample = dataset[0]
print(sample)

C) Reproducing the Paper’s CLIP Results

  1. Ensure DiDeMo videos exist at "video_root" and data in json format exist at "data_root"
  2. Run the full pipeline (training + evaluation)
python scripts/run_paper.py --data_root ./dataset --video_root ./videos --language ar --vit_size 16
  • Configure language (ar/en), ViT size (16/32), epochs, and checkpoint paths as needed.
  • Example hardware: 1× A100 (~20 min/epoch)

If you need the original DiDeMo download utilities, see the dataset’s official repository (contains JSONs and helper scripts). (link)

✅ Performance

image

📝 Citation

If you use AutoArabic or DiDeMo-AR in your research, please cite:

@misc{eltahir2025autoarabic,
  title        = {AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks},
  author       = {Mohamed Eltahir and Osamah Sarraj and Abdulrahman Alfrihidi and Taha Alshatiri and Mohammed Khurd and Mohammed Bremoo and Tanveer Hussain},
  year         = {2025},
  eprint       = {2509.16438},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  doi          = {10.48550/arXiv.2509.16438},
  url          = {https://arxiv.org/abs/2509.16438}
}


ArabicNLP 2025     KAUST Academy     KAUST

About

[ArabicNLP 2025] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors