Spelling Corrector to CODA / MSA standard

This is a Python script for converting Arabic text to CODA or MSA standard using pre-trained transformer models.

Introduction

This is a spelling corrector to the CODA or MSA standard. It uses the CODA MADARA corpus and QALB to train a spelling corrector. The corrector is based on transformer architecture. by fine-tuning AraBART and AraT5 models.

Requirements

This script requires the following Python packages:

transformers
torch
numpy
pandas
sentencepiece
wandb

To install these packages, run the following command:

pip install -r requirements.txt

Usage

To run the script, open a terminal and navigate to the directory containing the script. Then, run the following command:

python main.py [OPTIONS]

The following options are available:

--wandb: If this flag is set, the script logs the training process to WandB.
--test: If this flag is set, the script runs the test function instead of the train function.
--model_name: The name of the pre-trained transformer model to use. Default is "moussaKam/AraBART".
--hidden_size: The number of hidden units in the transformer model. Default is 300.
--num_layers: The number of layers in the transformer model. Default is 2.
--learning_rate: The learning rate to use for training the model. Default is 0.00003.
--num_epochs: The number of epochs to train the model for. Default is 10.
--optimizer: The optimizer to use for training the model. Default is "adam".
--batch_size: The batch size to use for training the model. Default is 8.
--sentence: The Arabic sentence to convert to the target dialect. If this option is not set, the script enters interactive mode and prompts the user to enter a sentence.
--path: The path to the directory containing the CODA corpus. Default is "coda-corpus".

For example, to run the script with the default hyperparameters and log the training process to WandB, run the following command:

python main.py --wandb

To train the model for 20 epochs, run the following command:

python main.py --num_epochs 20

To run the script in test mode, run the following command:

python main.py --test

To specify a different model, run the following command:

python main.py --model_name UBC-NLP/AraT5-base

To convert to MSA instead, run the following command:

python main.py --path qalb-corpus

To convert a specific sentence, run the following command:

python main.py --sentence "الجملة العربية التي تريد تحويلها او تصحيحها"

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
coda-corpus		coda-corpus
qalb-corpus		qalb-corpus
README.md		README.md
data_loader.py		data_loader.py
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spelling Corrector to CODA / MSA standard

Introduction

Requirements

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Spelling Corrector to CODA / MSA standard

Introduction

Requirements

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages