Causal Language Modeling for Code Generation

This repository contains a GPT-2-based causal language model trained on the CodeSearchNet Python dataset. The model is designed to generate Python code given a natural language docstring as input.

📌 Features

Custom Tokenizer: Built from scratch instead of using the default GPT-2 tokenizer.
Fine-tuned GPT-2: Trained on Python-specific code snippets.
Docstring-to-Code Generation: Generates function implementations from textual descriptions.
Efficient Training Pipeline: Implements dataset preprocessing, model training, and evaluation using PyTorch and Hugging Face’s Transformers.

Training Data

The model is trained on the Python subset of CodeSearchNet, a dataset containing 800K of function-docstring pairs.

🛠 Technologies Used

•	PyTorch  
•	Transformers (Hugging Face)  
•	Tokenizers  
•	Datasets (Hugging Face)

📜 License

This project is licensed under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebooks		notebooks
tokenizer		tokenizer
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
model_config.py		model_config.py
preprocessor.py		preprocessor.py
tokenizer.py		tokenizer.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Causal Language Modeling for Code Generation

📌 Features

Training Data

🛠 Technologies Used

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Causal Language Modeling for Code Generation

📌 Features

Training Data

🛠 Technologies Used

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages