Skip to content

nikhilrayaprolu/LLM

Repository files navigation

Transformer NMT Project

This project implements a sequence-to-sequence Transformer model for translating Finnish to English. The code is structured into separate modules for data preparation, training, and inference.

Project Structure

  • EUbookshop/: Contains the raw parallel corpus data.
  • prepare_tokenizer.py: Script to train BPE tokenizers for source and target languages.
  • prepare_dataset.py: Script to split the corpus into train, validation, and test sets.
  • model.py: Defines the Seq2SeqTransformer architecture.
  • utils.py: Contains shared utilities like the TranslationDataset class, decoding functions, and BLEU score calculation.
  • train.py: The main script to train the model and evaluate it on the test set.
  • inference.py: A dedicated script for interactive, command-line translation using a trained model.
  • requirements.txt: Project dependencies.

Setup

  1. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  2. Install dependencies:

    pip install -r requirements.txt

Usage Workflow

Follow these steps in order.

Step 1: Prepare the Tokenizers

Run the script to create bpe_tokenizer_fi.json and bpe_tokenizer_en.json.

python prepare_tokenizer.py

Step 2: Prepare the Dataset

Run the script to split the data. This will create a data_splits/ directory containing train, val, and test files for both languages.

python prepare_dataset.py

Step 3: Train the Model

Run the main training script. This will train the model, save the best checkpoint to best_transformer_model.pth based on validation loss, and finally evaluate it on the test set.

# Train with sinusoidal PE and then evaluate using default greedy decoding
python train.py --pos-encoding sinusoidal

# Train with ROPE PE and then evaluate using default greedy decoding
python train.py --pos-encoding rope

# Train with ROPE T5 and then evaluate using default greedy decoding
python train.py --pos-encoding t5

# To skip training and just evaluate a pre-existing model with beam search
python train.py --skip-training --pos-encoding rope --decoding-strategy beam --beam-width 5

Pretrained models

Download and place the pretrained models at the root of the folder Pretrained models available at: https://www.dropbox.com/scl/fo/p57lrgzydnuu9ueuub3eo/ABh9YvM4SqM1gaQP2Zd3oys?rlkey=fyonel0esz9n6q3uylulyq64k&st=yu4wcb26&dl=0

Step 4: Run Interactive Inference

Once the model is trained (best_transformer_model.pth exists), you can use the interactive script to translate sentences.

# Run with default greedy decoding
python inference.py

# Run with beam search
python inference.py --decoding-strategy beam --beam-width 5

# Run with top-k sampling
python inference.py --decoding-strategy topk --top-k 15

Type a Finnish sentence, press Enter, and the model will provide the English translation. Type quit or exit to close the session.# LLM

About

LLM for translation English to French

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages