Authors: Mohamed Eltahir, Osamah Sarraj, Abdulrahman Alfrihidi, Taha Alshatiri, Mohammed Khurd, Mohammed Bremoo, Tanveer Hussain
- [2025.09] Paper accepted to ArabicNLP 2025 (Workshop at EMNLP 2025)!
- [2025.09] Released DiDeMo-AR dataset with 40,144 Arabic captions
- [2025.09] AutoArabic released
- 🌍 DiDeMo-AR: First Arabic video-text retrieval benchmark with 40,144 fluent MSA descriptions
- 🚀 High accuracy in automatic error detection for translation validation
- ⚡ 4x reduction in manual revision effort through intelligent LLM-based workflow
- 🔧 Modular framework supporting multiple LLM providers and languages
Video-text retrieval has been dominated by English benchmarks (DiDeMo, MSR-VTT, VATEX), leaving Arabic severely underserved. AutoArabic addresses this gap with a three-stage framework that:
- Translates non-Arabic benchmarks to Modern Standard Arabic using state-of-the-art LLMs
- Detects translation errors automatically
- Routes flagged samples to expert annotators for postediting
Applied to DiDeMo, we created DiDeMo-AR, the first Arabic video retrieval benchmark.
📊 Comparison with Existing Benchmarks
| Dataset | Videos | Clip Length | Languages | Moment-level | Arabic |
|---|---|---|---|---|---|
| MSR-VTT | 10,000 | 15s | EN | ❌ | ❌ |
| VATEX | 41,250 | 10s | EN/ZH | ❌ | ❌ |
| DiDeMo | 10,464 | 30s | EN | ✅ | ❌ |
| RUDDER | 100k/lang | 5-10s | EN/ZH/FR/DE/RU | ❌ | ❌ |
| DiDeMo-AR | 10,464 | 30s | AR | ✅ | ✅ |
- Python ≥ 3.9, CUDA (recommended).
- Install from
requirements.txtin the repo root.
# Clone the repository
git clone https://github.com/Tahaalshatiri/AutoArabic.git
cd AutoArabic
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Download DiDeMo videos (optional, ~50GB)
bash scripts/download_videos.sh --dataset didemo --output ./videosSet API keys in environment (or pass as args): OPENAI_API_KEY, GEMINI_API_KEY.
Input format (CSV/JSON):
- CSV: one row per caption. Default columns are
desc(EN text) andvideo_id. You can override withdesc_columnandid_column. - JSON: list of objects like
{"video_id": "...", "desc": "..."}. You can override the JSON keys with the samedesc_column/id_columnarguments.
The tool checkpoints progress every k samples and can resume by passing the intermediate file again. If an ID column/key is provided, we use it to skip already-processed rows on resume.
from autoarabic import TranslationPipeline
pipeline = TranslationPipeline(
provider="gemini", # "gemini" | "openai" | "hf"
model="gemini-2.0-flash", # provider-specific model name when provider != "hf"
hf_model=None, # e.g., "unsloth/gemma-3-(1/4/12/27)b-it" when provider="hf"
error_detector="gpt-4o", # optional: "gpt-4o" | "gemini-2.0-flash" | unsloth/gemma-3-(1/4/12/27)b-it
openai_key="...", # or set via env OPENAI_API_KEY
gemini_key="...", # or set via env GEMINI_API_KEY
batch_size=1,
temperature=0.7,
device="cuda", # "cuda" | "cpu", used when provider="hf"
save_every=5, # checkpoint every k samples
resume_from=None # resume from intermediate JSON/CSV
)
results = pipeline.translate_dataset(
input_file="path/to/english_captions.json", # accepts CSV or JSON
output_file="path/to/arabic_captions.json",
enable_error_detection=True,
desc_column="desc", # default "desc"
id_column="video_id" # optional; used for resume/skip
)
print("processed:", results["total_processed"])
print("errors flagged:", results["errors_flagged"])from autoarabic.datasets import DiDeMoAR
dataset = DiDeMoAR(
root_dir="./dataset/json", # path to JSON files
video_root="./videos", # path to video files
split="train", # "train" | "val" | "test"
language="ar", # "ar" (localized) or "en" (original)
)
sample = dataset[0]
print(sample)- Ensure DiDeMo videos exist at "video_root" and data in json format exist at "data_root"
- Run the full pipeline (training + evaluation)
python scripts/run_paper.py --data_root ./dataset --video_root ./videos --language ar --vit_size 16- Configure language (ar/en), ViT size (16/32), epochs, and checkpoint paths as needed.
- Example hardware: 1× A100 (~20 min/epoch)
If you need the original DiDeMo download utilities, see the dataset’s official repository (contains JSONs and helper scripts). (link)
If you use AutoArabic or DiDeMo-AR in your research, please cite:
@misc{eltahir2025autoarabic,
title = {AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks},
author = {Mohamed Eltahir and Osamah Sarraj and Abdulrahman Alfrihidi and Taha Alshatiri and Mohammed Khurd and Mohammed Bremoo and Tanveer Hussain},
year = {2025},
eprint = {2509.16438},
archivePrefix= {arXiv},
primaryClass = {cs.CV},
doi = {10.48550/arXiv.2509.16438},
url = {https://arxiv.org/abs/2509.16438}
}



