This project implements a sequence-to-sequence Transformer model for translating Finnish to English. The code is structured into separate modules for data preparation, training, and inference.
EUbookshop/
: Contains the raw parallel corpus data.prepare_tokenizer.py
: Script to train BPE tokenizers for source and target languages.prepare_dataset.py
: Script to split the corpus into train, validation, and test sets.model.py
: Defines theSeq2SeqTransformer
architecture.utils.py
: Contains shared utilities like theTranslationDataset
class, decoding functions, and BLEU score calculation.train.py
: The main script to train the model and evaluate it on the test set.inference.py
: A dedicated script for interactive, command-line translation using a trained model.requirements.txt
: Project dependencies.
-
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
Follow these steps in order.
Run the script to create bpe_tokenizer_fi.json
and bpe_tokenizer_en.json
.
python prepare_tokenizer.py
Run the script to split the data. This will create a data_splits/
directory containing train
, val
, and test
files for both languages.
python prepare_dataset.py
Run the main training script. This will train the model, save the best checkpoint to best_transformer_model.pth
based on validation loss, and finally evaluate it on the test set.
# Train with sinusoidal PE and then evaluate using default greedy decoding
python train.py --pos-encoding sinusoidal
# Train with ROPE PE and then evaluate using default greedy decoding
python train.py --pos-encoding rope
# Train with ROPE T5 and then evaluate using default greedy decoding
python train.py --pos-encoding t5
# To skip training and just evaluate a pre-existing model with beam search
python train.py --skip-training --pos-encoding rope --decoding-strategy beam --beam-width 5
Download and place the pretrained models at the root of the folder Pretrained models available at: https://www.dropbox.com/scl/fo/p57lrgzydnuu9ueuub3eo/ABh9YvM4SqM1gaQP2Zd3oys?rlkey=fyonel0esz9n6q3uylulyq64k&st=yu4wcb26&dl=0
Once the model is trained (best_transformer_model.pth
exists), you can use the interactive script to translate sentences.
# Run with default greedy decoding
python inference.py
# Run with beam search
python inference.py --decoding-strategy beam --beam-width 5
# Run with top-k sampling
python inference.py --decoding-strategy topk --top-k 15
Type a Finnish sentence, press Enter, and the model will provide the English translation. Type quit
or exit
to close the session.# LLM