This repository contains the code for training a transformer model for language translation. The model is based on the paper "Attention Is All You Need" by Vaswani et al. (2017).
To run this code, you will need the following:
- Python 3.6 or later
- PyTorch 1.10 or later
- Transformers library 4.25 or later
- Hugging Face Tokenizers library 0.11 or later
- TensorBoard
To install the required dependencies, run the following command:
pip install -r requirements.txt
The code uses the WMT14 English-German dataset for training. The dataset can be downloaded from the following link:
Once downloaded, extract the dataset to a folder named data.
To train the model, run the following command:
python train.py
The training process will take several hours. The model will be saved to the models folder.
To evaluate the model, run the following command:
python evaluate.py
The evaluation process will take a few minutes. The results will be printed to the console.
The trained model can be used to translate text from English to German. To do this, run the following command:
python translate.py
The translated text will be printed to the console.
The code is organized into the following modules:
model.py: This module contains the definition of the transformer model.dataset.py: This module contains the definition of the dataset class.config.py: This module contains the definition of the configuration class.train.py: This module contains the code for training the model.evaluate.py: This module contains the code for evaluating the model.translate.py: This module contains the code for using the model to translate text.
The transformer model is defined in the model.py module. The model consists of an encoder and a decoder. The encoder converts the input sequence of tokens into a sequence of vectors. The decoder then uses the output of the encoder to the target sequence of tokens.
The encoder and decoder are both composed of a stack of identical layers. Each layer consists of a multi-head self-attention mechanism, followed by a position-wise fully connected feed-forward network. The self-attention mechanism the model to consider the context of each token in the input sequence when encoding it. The feed-forward network allows the model to learn complex non-linear transformations of the input vectors.
The transformer model also includes several other components, such as positional encoding, which is added to the input tokens to give the model information about the position of each token in the sequence. The model also includes a source and target vocabulary, which are used to convert the input and output sequences of tokens into vectors that can be processed by the model.
The dataset class is defined in the dataset.py module. The dataset class is responsible for loading and preprocessing the training and evaluation data. The dataset class includes methods for tokenizing the input and output sequences, adding positional encoding, and batching the data for training.
The dataset class also includes methods for loading the data from a variety of sources, such as text files, CSV files, or SQL databases. This allows the model to be trained on a wide range of data types and formats.
The configuration class is defined in the config.py module. The configuration class is used to store the hyperparameters and other configuration options for the model. This includes options such as the number of layers in the encoder and decoder, the size of the hidden state, and the type of attention mechanism to use.
The configuration class also includes methods for loading and saving the configuration to a file. This allows the configuration to be easily shared and reused across different experiments.
The training code is defined in the train.py module. The training code includes methods for training the model, evaluating the model, and saving the trained model to a file.
The training code uses the dataset class to load and preprocess the training data. The training code also includes methods for computing the loss and gradients, and for updating the model parameters.
The training code also includes methods for saving the trained model to a file, and for loading the trained model from a file. This allows the trained model to be easily reused in other applications.
The evaluation code is defined in the evaluate.py module. The evaluation code includes methods for evaluating the model on a variety of tasks, such as translation, summarization, and question answering.
The evaluation code uses the dataset class to load and preprocess the evaluation data. The evaluation code also includes methods for computing the evaluation metrics, such as BLEU score for translation or ROUGE score for summarization.
The translation code is defined in the translate.py module. The translation code includes methods for using the trained model to translate text from one language to another.
The translation code uses the dataset class to tokenize the input text and add positional encoding. The translation code then feeds the encoded input through the model to generate the target sequence of tokens. The translation code also includes methods for detokenizing the output sequence and postprocessing the translation to improve fluency and accuracy.
Overall, the transformers library provides a powerful and flexible framework for building and training transformer models. The library includes a wide range of components and tools for loading and preprocessing data, training and evaluating models, and using the trained models for a variety of tasks. The library also includes a large number of pre-trained models that can be fine-tuned for specific tasks, making it easy to get started with transformer models.