Skip to content

mittapallynitin/causal-languge-modeling

Repository files navigation

Causal Language Modeling for Code Generation

This repository contains a GPT-2-based causal language model trained on the CodeSearchNet Python dataset. The model is designed to generate Python code given a natural language docstring as input.

📌 Features

  • Custom Tokenizer: Built from scratch instead of using the default GPT-2 tokenizer.
  • Fine-tuned GPT-2: Trained on Python-specific code snippets.
  • Docstring-to-Code Generation: Generates function implementations from textual descriptions.
  • Efficient Training Pipeline: Implements dataset preprocessing, model training, and evaluation using PyTorch and Hugging Face’s Transformers.

Training Data

The model is trained on the Python subset of CodeSearchNet, a dataset containing 800K of function-docstring pairs.

🛠 Technologies Used

•	PyTorch  
•	Transformers (Hugging Face)  
•	Tokenizers  
•	Datasets (Hugging Face)  

📜 License

This project is licensed under the MIT License. See LICENSE for details.

About

Fine tune GPT2 model for Python code generation. In this project I fine tune the GPT2 model for python code generation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors