This repository contains a from-scratch implementation of a scaled-down GPT-2 model, built entirely in PyTorch. The primary goal of this project is to demonstrate the inner workings of a modern transformer-based language model by building one from the ground up. The model is trained on the Persian Snappfood Comments dataset to perform sentiment-controlled text generation. By conditioning the model on special tokens (<POSITIVE> and <NEGATIVE>), it can generate new, synthetic Persian comments with a specified sentiment. This project was developed as a practical assignment for the Deep Learning course (Spring 2025).
- From-Scratch Implementation: The core GPT-2 architecture (Causal Self-Attention, MLP, Transformer Blocks) is built from scratch using pure PyTorch.
- Sentiment Control: The model is conditioned on special tokens (
<POSITIVE>,<NEGATIVE>) to control the sentiment of the generated text. - Persian Text Generation: Trained on a large corpus of Persian comments to generate coherent, in-domain text.
- Advanced Decoding: Implements temperature, top-k, and nucleus (top-p) sampling for flexible and high-quality text generation.
- Transformer Architecture (Decoder-Only): Implements the decoder-only architecture popularized by GPT.
- Causal Self-Attention: Uses masked multi-head attention to ensure tokens can only attend to preceding tokens, which is essential for auto-regressive generation.
- Learnable Positional Embeddings: Uses a learnable embedding layer (
wpe) to inject positional information, as opposed to fixed sinusoidal encodings. - Sentiment Conditioning: The model learns to associate the special prefix tokens with the sentiment of the following text, allowing for controlled generation by providing the desired token as a prompt.
- Text Generation & Decoding: The
generatemethod demonstrates auto-regressive sampling strategies to decode text from the model's probability distributions.
This project is a from-scratch implementation of the GPT-2 architecture, demonstrating the core mechanics of a decoder-only transformer. The entire system is built to be modular, explainable, and controllable.
The model's architecture, defined in src/model.py, follows the standard GPT-2 design at a smaller scale for feasible training.
-
Input Embeddings: An input sequence of token IDs (e.g.,
[128000, 503, 201, 89]) is passed through two embedding layers simultaneously:-
Token Embeddings (
wte): Converts each token ID into a dense vector (sizen_embd=192). This vector represents "what" the token is. -
Positional Embeddings (
wpe): A separate, learnable embedding layer creates a vector for each position (0, 1, 2, 3...) in the sequence. This represents "where" the token is.
-
Token Embeddings (
-
Initial Representation: The token and positional embeddings are summed element-wise. This combined tensor, which now contains both "what" and "where", is passed through a dropout layer and then fed into the main transformer stack.
-
Transformer Blocks (
Block): The model is a stack of$N=3$ identical transformer blocks. Each block refines the text representation by gathering and processing information from other tokens. We use a Pre-Norm architecture for stability:-
Step 1. Causal Self-Attention: The input
xis first normalized (self.ln_1). This normalized output is then fed into theCausalSelfAttentionmodule. The module's output is added back to the originalx(a residual connection). -
Step 2. Feed-Forward Network (MLP): The output from the attention step is normalized again (
self.ln_2). This is passed through a two-layerMLP. The MLP's output is added back to its input (a second residual connection).
-
Step 1. Causal Self-Attention: The input
-
Final Output: After passing through all
$N$ blocks, the final representation is normalized (ln_f) and passed through the final linear layer (lm_head). This projects the vector from the embedding dimension$C$ up to the full vocabulary size$V$ , producing the logits (shape[B, T, V])—the raw, un-normalized scores for every possible next token.
This is the most critical component for auto-regressive generation. Its purpose is to prevent a token from "seeing" tokens that come after it in the sequence.
-
Score Calculation: For each token, the model calculates a Query vector (
$Q$ ). It also calculates a Key vector ($K$ ) and a Value vector ($V$ ) for all tokens in the context. -
Relevance Scores: To decide "how much attention" a token at position
$i$ should pay to a token at position$j$ , the model computes a score:$Score_{i,j} = Q_i \cdot K_j$ . This is done in parallel for all tokens using matrix multiplication:$Scores = QK^T$ . -
Masking: This is the "causal" part. We apply a mask
$M$ to the scores before the softmax step. This mask is a matrix that sets all values above the main diagonal to$-\infty$ .-
$Score_{i,j}$ (where$j > i$ ) is set to$-\infty$ . - This means a token at position 2 (e.g., "food") cannot see the token at position 3 (e.g., "was"). It can only see itself (pos 2) and the tokens before it (pos 0, 1).
-
-
Softmax: When we apply
$softmax(Scores)$ , any score of$-\infty$ becomes$0$ (since$e^{-\infty} = 0$ ). This effectively zeroes out all "future" positions. -
Final Output: The resulting attention weights are multiplied by the Value vectors (
$V$ ) to create a new representation for each token, now enriched with context from its past.
The full formula is:
Where
The model isn't just generating text; it's generating text given a condition. This is not a hard-coded rule but an emergent property of the training.
- Training Data: The model is trained on tens of thousands of examples like:
"<POSITIVE> عالی بود ممنون""<NEGATIVE> خیلی سرد بود"
- Learning Associations: Through backpropagation, the model learns that the token
<POSITIVE>is a strong statistical predictor for sequences of words like "عالی", "خوب", and "ممنون". Conversely,<NEGATIVE>predicts "بد", "سرد", and "افتضاح". - Steering Generation: The token embedding for
<POSITIVE>(a learnable vector) essentially acts as a "control signal." When the model sees this token, it "steers" the subsequent computations, pushing the probability distribution at each step towards the part of its learned vocabulary associated with positive reviews. - Generation: When we want to generate a positive comment, we simply feed the model the token
<POSITIVE>as the starting prompt. The model's training has taught it that the most likely next tokens after this prompt are the beginnings of a positive comment.
The model learns by being trained on a next-token prediction task using Cross-Entropy Loss. For any given sequence, we use the logits at position
- Input:
["<POSITIVE>", "غذا", "خوب"] - Target:
["غذا", "خوب", "بود"]Theforwardpass in ourGPT2model handles this "shifting" automatically by comparinglogits[..., :-1, :](all logits except the last) withlabels[..., 1:](all labels except the first). This teaches the model to answer the question: "Given the text so far, what is the most likely next word?"
This is where the project comes to life. Generation is an auto-regressive loop, meaning we generate one token at a time, feed it back into the model, and then generate the next.
-
Prompt: We start by providing a prompt, which is just our special control token (e.g.,
input_ids = ["<POSITIVE>"]). - Prediction: The model takes this prompt and produces logits for the next token.
-
Sampling: We now have a probability distribution over the entire vocabulary. Instead of just picking the most likely token (greedy decoding), which is repetitive, we sample from this distribution using techniques like
temperature,top_k, andtop_pto control the creativity and coherence of the output. -
Append: A new token (e.g.,
"خیلی") is sampled. -
Loop: The new token is appended to our input sequence. The new sequence (
input_ids = ["<POSITIVE>", "خیلی"]) is fed back into the model. - This loop repeats
$N$ times to generate a complete comment.
pytorch-gpt2-persian-sentiment-generation/
├── .gitignore # Ignores data, models, logs, and Python cache
├── LICENSE # MIT License file
├── README.md # This file
├── requirements.txt # Project dependencies
├── notebooks/
│ └── demo.ipynb # A guided notebook to run training and generation
├── scripts/
│ ├── train.py # Main script to train the model
│ └── generate.py # Script to generate text with a trained model
└── src/
├── init.py
├── config.py # GPT2Config class
├── dataset.py # CommentDataset and dataloader functions
├── model.py # CausalSelfAttention, MLP, Block, and GPT2 classes
└── utils.py # Logging setup, plotting, and data download helpers
-
Clone the Repository:
git clone [https://github.com/msmrexe/pytorch-gpt2-persian-sentiment-generation.git cd pytorch-gpt2-persian-sentiment-generation -
Setup Environment and Install Dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Authenticate (One-Time Setup): You must authenticate with Hugging Face to download the tokenizer and Kaggle to download the dataset.
# 1. Log in to Hugging Face huggingface-cli login # 2. Set up your Kaggle API token # Download 'kaggle.json' from your Kaggle account # And place it in ~/.kaggle/kaggle.json mkdir -p ~/.kaggle cp /path/to/your/kaggle.json ~/.kaggle/ chmod 600 ~/.kaggle/kaggle.json
-
Train the Model: Run the
train.pyscript. The script will automatically download the dataset to thedata/directory. You can customize all hyperparameters via arguments.python scripts/train.py \ --epochs 5 \ --batch_size 32 \ --lr 1e-4 \ --n_embd 192 \ --n_layer 3 \ --n_head 3The best model will be saved to
models/best_gpt2_model.pt. -
Generate Text: Use the
generate.pyscript to generate text with your trained model.# Generate positive comments python scripts/generate.py --sentiment positive --num_samples 5 # Generate negative comments with different parameters python scripts/generate.py \ --sentiment negative \ --num_samples 3 \ --temperature 1.2 \ --top_k 50
-
Run the Demo Notebook: For a guided, step-by-step walkthrough, open and run the
notebooks/demo.ipynbnotebook.jupyter notebook notebooks/demo.ipynb
Feel free to connect or reach out if you have any questions!
- Maryam Rezaee
- GitHub: @msmrexe
- Email: [email protected]
This project is licensed under the MIT License. See the LICENSE file for full details.