Implementation of "Pre-Trained Transformer-Based Approach for Arabic Question Answering" using AraBERT and AraELECTRA models.
- Project Overview
- Project Structure
- Models
- Datasets
- Data Pipeline
- Training Configuration
- Evaluation Metrics
- Usage
- Requirements
This project fine-tunes Arabic pre-trained transformer models for the Extractive Question Answering task. Given a question and a context passage, the model predicts the answer span within the context.
Task Type: Extractive QA (Span Prediction)
projet3_Q&A/
├── dataset/ # Processed dataset files
├── doc/ # Documentation files
├── models/ # Saved models
├── notebooks/ # Jupyter notebooks for training and analysis
│ ├── AraELECTRA.ipynb
│ ├── bert-base-arabertv2.ipynb
│ ├── cleaning.ipynb
│ ├── data_pipeline.ipynb
│ ├── exploration.ipynb
│ └── preprocessing.ipynb
├── unprocessed_data/ # Raw source data
├── utils/ # Utility scripts
│ ├── clean_text.py
│ ├── extract_text_from_file.py
│ └── merged_data.py
├── app.py # Main application entry point
├── requirements.txt # Project dependencies
└── README.md # Project documentation
| Attribute | Value |
|---|---|
| Architecture | BERT-base (Bidirectional Transformer Encoder) |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Parameters | ~110M |
| Max Sequence Length | 512 |
| Preprocessing | Requires Farasa Segmentation |
| Checkpoint | aubmindlab/bert-base-arabertv2 |
| Attribute | Value |
|---|---|
| Architecture | BERT-large |
| Layers | 24 |
| Hidden Size | 1024 |
| Attention Heads | 16 |
| Parameters | ~336M |
| Max Sequence Length | 512 |
| Preprocessing | None |
| Checkpoint | aubmindlab/bert-large-arabertv02 |
| Attribute | Value |
|---|---|
| Architecture | ELECTRA (Discriminator) |
| Layers | 12 |
| Hidden Size | 768 |
| Attention Heads | 12 |
| Parameters | ~136M |
| Max Sequence Length | 512 |
| Training Method | Replaced Token Detection (RTD) |
| Preprocessing | None |
| Checkpoint | aubmindlab/araelectra-base-discriminator |
The project merges multiple Arabic QA datasets: link_dataset_cleaned
| Dataset | Format | Description |
|---|---|---|
| Arabic-SQuAD | JSON | Arabic translation of SQuAD |
| ARCD | JSON | Arabic Reading Comprehension Dataset |
| AQAD | JSON | Arabic Question Answering Dataset |
| TyDiQA-GoldP | JSONL | Typologically Diverse QA (Arabic subset) |
All datasets are normalized to:
| Column | Description |
|---|---|
id |
Unique identifier |
question |
Question text |
answer |
Answer text |
context |
Passage/context containing the answer |
title |
Document title (optional) |
category |
Category/label (optional) |
source_file |
Original source file |
Combines all source files into master_data.csv:
- Loads JSON, JSONL, CSV formats
- Normalizes column names
- Extracts answers from nested structures
- Removes duplicates
- Removes null values
- Drops metadata columns (keeping question, answer)
- Outputs
data_cleaned.csv
Text preprocessing function:
def preprocess_arabic_text(text):
# 1. Remove emojis
text = emoji.replace_emoji(text, replace="")
# 2. Remove HTML (Exception: TyDiQA preserves HTML)
text = re.sub(r'<.*?>', '', text)
# 3. Replace URLs, emails, mentions
text = re.sub(r'http[s]?://...', '[URL]', text)
text = re.sub(r'email_pattern', '[EMAIL]', text)
text = re.sub(r'@username', '[MENTION]', text)
# 4. Remove Arabic diacritics & tatweel
text = re.sub(r'[\u064B-\u0652\u0640]', '', text)
# 5. Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
return textFilters garbage patterns:
garbage_pattern = r"array\(|dtype=|\{'text':|\\[|\\]"
df = df.filter(~pl.col("answer").str.contains(garbage_pattern))Outputs: data_preprocessed.csv
| Parameter | Value |
|---|---|
| Learning Rate | 3e-5 |
| Batch Size | 4 |
| Epochs | 3 (merged dataset) / 4 (single dataset) |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Max Sequence Length | 384 |
| Max Question Length | 64 |
| Doc Stride | 128 |
| Max Answer Length | 30 |
[CLS] Question [SEP] Context [SEP]
AraBERTv2-base Only:
from arabert.preprocess import ArabertPreprocessor
prep = ArabertPreprocessor(model_name="aubmindlab/bert-base-arabertv2")
text = prep.preprocess(text) # Applies Farasa segmentationBinary metric: 1 if prediction exactly matches ground truth, else 0.
def compute_exact(prediction, ground_truth):
return int(normalize(prediction) == normalize(ground_truth))Token-level harmonic mean of precision and recall:
def compute_f1(prediction, ground_truth):
pred_tokens = prediction.split()
gold_tokens = ground_truth.split()
common = Counter(pred_tokens) & Counter(gold_tokens)
num_same = sum(common.values())
precision = num_same / len(pred_tokens) if pred_tokens else 0
recall = num_same / len(gold_tokens) if gold_tokens else 0
if precision + recall == 0:
return 0
return 2 * precision * recall / (precision + recall)Note: For multiple ground truth answers, take the maximum score.
pip install transformers datasets torch polars arabert emoji farasapy accelerateRun notebooks in order:
preprocessing.ipynb→ Createsdata_cleaned.csvcleaning.ipynb→ Createsdata_preprocessed.csv
Open main.ipynb and run all cells. To switch models, modify:
# Choose one:
train_pipeline("AraBERTv2-base") # Requires Farasa
train_pipeline("AraBERTv0.2-large") # Large model, needs GPU memory
train_pipeline("AraELECTRA-base") # Fast trainingTrained models saved to:
final_models/
├── AraBERTv2-base/
├── AraBERTv0.2-large/
└── AraELECTRA-base/
torch>=1.10
transformers>=4.20
datasets>=2.0
polars>=0.15
arabert
emoji
farasapy
accelerate
scikit-learn
Hardware:
- GPU recommended (CUDA)
- AraBERT-large requires ~16GB VRAM
This project is for educational purposes.