Transformer NMT Project

This project implements a sequence-to-sequence Transformer model for translating Finnish to English. The code is structured into separate modules for data preparation, training, and inference.

Project Structure

EUbookshop/: Contains the raw parallel corpus data.
prepare_tokenizer.py: Script to train BPE tokenizers for source and target languages.
prepare_dataset.py: Script to split the corpus into train, validation, and test sets.
model.py: Defines the Seq2SeqTransformer architecture.
utils.py: Contains shared utilities like the TranslationDataset class, decoding functions, and BLEU score calculation.
train.py: The main script to train the model and evaluate it on the test set.
inference.py: A dedicated script for interactive, command-line translation using a trained model.
requirements.txt: Project dependencies.

Setup

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install dependencies:
```
pip install -r requirements.txt
```

Usage Workflow

Follow these steps in order.

Step 1: Prepare the Tokenizers

Run the script to create bpe_tokenizer_fi.json and bpe_tokenizer_en.json.

python prepare_tokenizer.py

Step 2: Prepare the Dataset

Run the script to split the data. This will create a data_splits/ directory containing train, val, and test files for both languages.

python prepare_dataset.py

Step 3: Train the Model

Run the main training script. This will train the model, save the best checkpoint to best_transformer_model.pth based on validation loss, and finally evaluate it on the test set.

# Train with sinusoidal PE and then evaluate using default greedy decoding
python train.py --pos-encoding sinusoidal

# Train with ROPE PE and then evaluate using default greedy decoding
python train.py --pos-encoding rope

# Train with ROPE T5 and then evaluate using default greedy decoding
python train.py --pos-encoding t5

# To skip training and just evaluate a pre-existing model with beam search
python train.py --skip-training --pos-encoding rope --decoding-strategy beam --beam-width 5

Pretrained models

Download and place the pretrained models at the root of the folder Pretrained models available at: https://www.dropbox.com/scl/fo/p57lrgzydnuu9ueuub3eo/ABh9YvM4SqM1gaQP2Zd3oys?rlkey=fyonel0esz9n6q3uylulyq64k&st=yu4wcb26&dl=0

Step 4: Run Interactive Inference

Once the model is trained (best_transformer_model.pth exists), you can use the interactive script to translate sentences.

# Run with default greedy decoding
python inference.py

# Run with beam search
python inference.py --decoding-strategy beam --beam-width 5

# Run with top-k sampling
python inference.py --decoding-strategy topk --top-k 15

Type a Finnish sentence, press Enter, and the model will provide the English translation. Type quit or exit to close the session.# LLM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Transformer NMT Project

Project Structure

Setup

Usage Workflow

Step 1: Prepare the Tokenizers

Step 2: Prepare the Dataset

Step 3: Train the Model

Pretrained models

Step 4: Run Interactive Inference

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
bpe_tokenizer.py		bpe_tokenizer.py
inference.py		inference.py
model.py		model.py
prepare_dataset.py		prepare_dataset.py
prepare_tokenizer.py		prepare_tokenizer.py
report.pdf		report.pdf
requirements.txt		requirements.txt
train.py		train.py
utils.py		utils.py

nikhilrayaprolu/LLM

Folders and files

Latest commit

History

Repository files navigation

Transformer NMT Project

Project Structure

Setup

Usage Workflow

Step 1: Prepare the Tokenizers

Step 2: Prepare the Dataset

Step 3: Train the Model

Pretrained models

Step 4: Run Interactive Inference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages