CustomGPT is a re-implementation of 124 Million parameter GPT2 model from scratch, constructing each architectural component block-by-block using PyTorch, closely following the original design specifications. โจ
HuggingFace Repository: https://huggingface.co/NamrataThakur
The model was trained using a transformer architecture with self-attention mechanisms to capture contextual relationships in text. In version 1, we implemented three core fine-tuning strategies: supervised classification fine-tuning (SFT), instruction fine-tuning (IFT), and preference fine-tuning (PFT) using Direct Preference Optimization (DPO) Loss. Each and every component is built from scratch.
Every component was built entirely from scratch with the goal of deeply understanding the inner workings of the so-called "black box" โ the Large Language Model.
A fully modular training pipeline was also developed following Object-Oriented Programming (OOP) principles, enabling flexible integration of various training strategies and streamlined construction of custom data loaders.
CustomGPT uses a standard 124M GPT decoder-only transformer architecture with the exact specifications mentioned in the paper: Language Models are Unsupervised Multitask Learners
- 12 transformer blocks ๐งฑ
- 12 attention heads ๐๏ธ
- 768 embedding dimensions ๐
- Vocabulary size of 50,257 tokens ๐
- Context window of 1024 tokens ๐ช
The models (in version 1) were fine-tuned on the datasets provided in the book : Build LLM From Scratch
- A classification model that can categorise spam vs ham messages.
- An instruct model (fine-tuned on GPT2 355M) to follow instructions using the Alpaca Prompt Template.
- A preference-tuned model (fine-tuned on the above instruct model) to give polite responses to the given queries.
To run the training pipeline, follow these steps:
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git
#Create an environment:
python -m venv env
# Install the required packages
pip install -r requirements.txt
# Keep your training data in the "data" folder
# Change the training parameters according to the training type (SFT, IFT, PFT) in the run_gpt2_train.sh file
# Run the bash file containing training arguments
bash run_gpt2_train.shThe easiest way to interact with fine-tuned models is through the Chainlit interface:
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git
#Create an environment:
python -m venv env
# Install the required packages
pip install -r requirements.txt
# Run the chainlit command to start the interface on localhost:
chainlit run main.pyThis will launch a web application where you can input text and see the model's generated responses based on the type of model chosen: Chat Model or Classification Model. You can also change the text generation settings like Temperature, Top-K, and Max New Tokens for the Chat Model.
Chainlit Demo Video:
The models were trained using PyTorch. The training process involved:
- Tokenizing the input text
- Creating sliding windows of fixed block size
- Training the model with cross-entropy loss (for SFT and IFT) and direct preference optimization (DPO) loss (for PFT).
- Applying learning rate scheduling with warmup and cosine decay.
- Gradient clipping is applied to avoid too large gradient changes. The maximum norm threshold given is 1.
- Chat Model (IFT / PFT) :
Correct Responses:
Prompt: definition of haughty
Output:
A haughty person is someone who is arrogant and conceited.
Prompt: What is the molecular formula for salt?
Output:
The molecular formula for salt is NaCl.
Incorrect Responses:
Prompt: Name the colors present in a rainbow?
Output:
The colors present in a rainbow are red, orange, yellow, and blue.
Prompt: opposite of haughty
Output:
Hughty is a type of arrogant.
- Classification Model (SFT) :
Correct Responses:
Input: ๐ Amazon is hiring! Work from home and earn โน5,000/day. No experience needed.
Register now: [joblink.xyz] โ Limited slots!
Output:
Spam
Input: Your SBI account will be blocked!
Update your KYC immediately at: [sbi-update-kyc.info]
Output:
Spam
Incorrect Responses:
Prompt: Final notice: Your car insurance policy will be cancelled today. Renew instantly ๐
Output:
Ham
Prompt: Dear user, your electricity bill payment failed. Pay now to avoid disconnection:
Output:
Ham
The models presented in this version were fine-tuned using a relatively small dataset, which naturally limited their ability to generalize and, in turn, led to certain incorrect responses. However, achieving high predictive performance was not the primary objective at this stage. Instead, the central aim of Version 1 was to design and implement the complete training pipelineโcovering data preprocessing, model integration, fine-tuning strategies, and evaluationโfrom the ground up. By focusing on building a fully functional, end-to-end pipeline, this version served as a foundational stage in which establishing modularity, scalability, and clarity of design took precedence over optimizing the performance of individual models.
In future iterations, our focus will shift toward systematically enhancing the accuracy of the models through refined training strategies and expanded datasets.
During inference, the instruction fine-tuned model uses several techniques to produce high-quality text:
- Temperature scaling for controlling randomness
- Top-k for focus and diversity
- Efficient token generation, one at a time, till the specified maximum token count is achieved
We are deeply grateful to the following teachers who helped, taught, and inspired us to start this project:
- Sebastian Raschka for his excellent book Build LLM From Scratch
- Umar Jamil for his deep videos on topics like RLHF, DPO, LoRA etc.
This project is licensed under the MIT license - see the LICENSE file for details.
If you find CustomGPT useful, please consider starring the repository โญ
