Arabic Question Answering with Pre-Trained Transformers

Implementation of "Pre-Trained Transformer-Based Approach for Arabic Question Answering" using AraBERT and AraELECTRA models.

Project Overview

This project fine-tunes Arabic pre-trained transformer models for the Extractive Question Answering task. Given a question and a context passage, the model predicts the answer span within the context.

Task Type: Extractive QA (Span Prediction)

Project Structure

projet3_Q&A/
├── dataset/                 # Processed dataset files
├── doc/                     # Documentation files
├── models/                  # Saved models
├── notebooks/               # Jupyter notebooks for training and analysis
│   ├── AraELECTRA.ipynb
│   ├── bert-base-arabertv2.ipynb
│   ├── cleaning.ipynb
│   ├── data_pipeline.ipynb
│   ├── exploration.ipynb
│   └── preprocessing.ipynb
├── unprocessed_data/        # Raw source data
├── utils/                   # Utility scripts
│   ├── clean_text.py
│   ├── extract_text_from_file.py
│   └── merged_data.py
├── app.py                   # Main application entry point
├── requirements.txt         # Project dependencies
└── README.md                # Project documentation

Models

1. AraBERTv2-base

Attribute	Value
Architecture	BERT-base (Bidirectional Transformer Encoder)
Layers	12
Hidden Size	768
Attention Heads	12
Parameters	~110M
Max Sequence Length	512
Preprocessing	Requires Farasa Segmentation
Checkpoint	`aubmindlab/bert-base-arabertv2`

2. AraBERTv0.2-large

Attribute	Value
Architecture	BERT-large
Layers	24
Hidden Size	1024
Attention Heads	16
Parameters	~336M
Max Sequence Length	512
Preprocessing	None
Checkpoint	`aubmindlab/bert-large-arabertv02`

3. AraELECTRA-base-discriminator

Attribute	Value
Architecture	ELECTRA (Discriminator)
Layers	12
Hidden Size	768
Attention Heads	12
Parameters	~136M
Max Sequence Length	512
Training Method	Replaced Token Detection (RTD)
Preprocessing	None
Checkpoint	`aubmindlab/araelectra-base-discriminator`

Datasets

The project merges multiple Arabic QA datasets: link_dataset_cleaned

Dataset	Format	Description
Arabic-SQuAD	JSON	Arabic translation of SQuAD
ARCD	JSON	Arabic Reading Comprehension Dataset
AQAD	JSON	Arabic Question Answering Dataset
TyDiQA-GoldP	JSONL	Typologically Diverse QA (Arabic subset)

Unified Schema

All datasets are normalized to:

Column	Description
`id`	Unique identifier
`question`	Question text
`answer`	Answer text
`context`	Passage/context containing the answer
`title`	Document title (optional)
`category`	Category/label (optional)
`source_file`	Original source file

Data Pipeline

Step 1: Merge Data (`utils/merge_data.py`)

Combines all source files into master_data.csv:

Loads JSON, JSONL, CSV formats
Normalizes column names
Extracts answers from nested structures
Removes duplicates

Step 2: Preprocessing (`preprocessing.ipynb`)

Removes null values
Drops metadata columns (keeping question, answer)
Outputs data_cleaned.csv

Step 3: Cleaning (`cleaning.ipynb`)

Text preprocessing function:

def preprocess_arabic_text(text):
    # 1. Remove emojis
    text = emoji.replace_emoji(text, replace="")
    
    # 2. Remove HTML (Exception: TyDiQA preserves HTML)
    text = re.sub(r'<.*?>', '', text)
    
    # 3. Replace URLs, emails, mentions
    text = re.sub(r'http[s]?://...', '[URL]', text)
    text = re.sub(r'email_pattern', '[EMAIL]', text)
    text = re.sub(r'@username', '[MENTION]', text)
    
    # 4. Remove Arabic diacritics & tatweel
    text = re.sub(r'[\u064B-\u0652\u0640]', '', text)
    
    # 5. Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

Filters garbage patterns:

garbage_pattern = r"array\(|dtype=|\{'text':|\\[|\\]"
df = df.filter(~pl.col("answer").str.contains(garbage_pattern))

Outputs: data_preprocessed.csv

Training Configuration

Hyperparameters (From Paper)

Parameter	Value
Learning Rate	3e-5
Batch Size	4
Epochs	3 (merged dataset) / 4 (single dataset)
Optimizer	AdamW
Weight Decay	0.01
Max Sequence Length	384
Max Question Length	64
Doc Stride	128
Max Answer Length	30

Input Format

[CLS] Question [SEP] Context [SEP]

Model-Specific Preprocessing

AraBERTv2-base Only:

from arabert.preprocess import ArabertPreprocessor
prep = ArabertPreprocessor(model_name="aubmindlab/bert-base-arabertv2")
text = prep.preprocess(text)  # Applies Farasa segmentation

Evaluation Metrics

Exact Match (EM)

Binary metric: 1 if prediction exactly matches ground truth, else 0.

def compute_exact(prediction, ground_truth):
    return int(normalize(prediction) == normalize(ground_truth))

F1 Score

Token-level harmonic mean of precision and recall:

def compute_f1(prediction, ground_truth):
    pred_tokens = prediction.split()
    gold_tokens = ground_truth.split()
    common = Counter(pred_tokens) & Counter(gold_tokens)
    num_same = sum(common.values())
    
    precision = num_same / len(pred_tokens) if pred_tokens else 0
    recall = num_same / len(gold_tokens) if gold_tokens else 0
    
    if precision + recall == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

Note: For multiple ground truth answers, take the maximum score.

Usage

1. Install Dependencies

pip install transformers datasets torch polars arabert emoji farasapy accelerate

2. Prepare Data

Run notebooks in order:

preprocessing.ipynb → Creates data_cleaned.csv
cleaning.ipynb → Creates data_preprocessed.csv

3. Train Models

Open main.ipynb and run all cells. To switch models, modify:

# Choose one:
train_pipeline("AraBERTv2-base")      # Requires Farasa
train_pipeline("AraBERTv0.2-large")   # Large model, needs GPU memory
train_pipeline("AraELECTRA-base")     # Fast training

4. Output

Trained models saved to:

final_models/
├── AraBERTv2-base/
├── AraBERTv0.2-large/
└── AraELECTRA-base/

Requirements

torch>=1.10
transformers>=4.20
datasets>=2.0
polars>=0.15
arabert
emoji
farasapy
accelerate
scikit-learn

Hardware:

GPU recommended (CUDA)
AraBERT-large requires ~16GB VRAM

References

License

This project is for educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Question Answering with Pre-Trained Transformers

Table of Contents

Project Overview

Project Structure

Models

1. AraBERTv2-base

2. AraBERTv0.2-large

3. AraELECTRA-base-discriminator

Datasets

Unified Schema

Data Pipeline

Step 1: Merge Data (`utils/merge_data.py`)

Step 2: Preprocessing (`preprocessing.ipynb`)

Step 3: Cleaning (`cleaning.ipynb`)

Training Configuration

Hyperparameters (From Paper)

Input Format

Model-Specific Preprocessing

Evaluation Metrics

Exact Match (EM)

F1 Score

Usage

1. Install Dependencies

2. Prepare Data

3. Train Models

4. Output

Requirements

References

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
arabic_context		arabic_context
doc		doc
notebooks		notebooks
utils		utils
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Achraftic/Arabic_Question_Answering

Folders and files

Latest commit

History

Repository files navigation

Arabic Question Answering with Pre-Trained Transformers

Table of Contents

Project Overview

Project Structure

Models

1. AraBERTv2-base

2. AraBERTv0.2-large

3. AraELECTRA-base-discriminator

Datasets

Unified Schema

Data Pipeline

Step 1: Merge Data (utils/merge_data.py)

Step 2: Preprocessing (preprocessing.ipynb)

Step 3: Cleaning (cleaning.ipynb)

Training Configuration

Hyperparameters (From Paper)

Input Format

Model-Specific Preprocessing

Evaluation Metrics

Exact Match (EM)

F1 Score

Usage

1. Install Dependencies

2. Prepare Data

3. Train Models

4. Output

Requirements

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 1: Merge Data (`utils/merge_data.py`)

Step 2: Preprocessing (`preprocessing.ipynb`)

Step 3: Cleaning (`cleaning.ipynb`)

Packages