Short description This repository implements a compact, efficient Retrieval-Augmented Generation (RAG) pipeline designed to answer multiple-choice questions in Islamic knowledge (theology, jurisprudence, biography, ethics) in Arabic. It combines domain-specific embeddings, a vector-store retriever, and a lightweight Large Language Model (LLM) generator to produce grounded, high-accuracy answers.
Paper / Reference The approach and experimental results are described in the ACL ArabicNLP 2025 proceedings. You can read the full paper here: Paper Link
- Preprocessing for Arabic classical texts (diacritics removal, Tatweel removal, normalization, chunking)
- Embedding-based retrieval using Muffakir_Embedding (768-dim)
- Vector store: Qdrant (HNSW, cosine similarity)
- Generator: Gemini 2.5 Flash Lite (or any compatible LLM)
- Simple, reproducible configuration that achieved 87% accuracy on the QIAS 2025 test set (ranked 5th / 10 teams)
- Source documents: curated collection of classical Islamic reference works (Usul al-Fiqh, Al-Itqan, Sirah, Tafsir collections, etc.).
- Cleaning steps: remove diacritics, Tatweel, emails, urls, normalise whitespace/punctuation.
- Chunking: 400-character chunks with 100-character overlap (configurable).
- Ingestion: load canonical texts, clean, chunk, annotate with metadata.
- Embedding: compute Muffakir_Embedding (768-dim) per chunk.
- Indexing: store vectors in Qdrant using HNSW with recommended HNSW parameters.
- Retrieval: embed query -> Top-K (K=10) cosine similarity search.
- Prompt construction: assemble Top-K chunks with the question and choices into a structured prompt.
- Generation: feed prompt to the LLM (Gemini 2.5 Flash Lite) and parse the returned answer.
- Chunk size:
400characters - Overlap:
100characters - Top-K retrieval:
10 - Embedding dim:
768(Muffakir_Embedding) - Vector count (example): ~15,000
- Qdrant / HNSW parameters:
m=64,ef_construct=1024,full_scan_threshold=0
- Test set accuracy: 87.00% (official test set, QIAS 2025 Subtask 2)
- Ranking: 5th out of 10 teams
- Performance by difficulty: Beginner (≈89.1%), Intermediate (≈83.4%), Advanced (≈75.4%).
- Muffakir embeddings were a major factor in performance; try alternative domain-tuned embeddings if available.
- Reranking did not always help; direct retrieval + LLM gave more reliable results in the reported experiments.
- Use overlap and chunk-size tuning to trade off retrieval granularity vs. context size.
Original paper authors: Mohamed Samy, Mayar Boghdady, Marwan El Adawi, Mohamed Nassar, Ensaf Hussein.
For questions about this implementation or the paper, contact: mohamedsamyy02@gmail.com.