Mini CodeLLM

Write-up:

Getting Started

codellm_data: Parses and downloads datasets.
codellm_tokenizer: Train a custom byte-level BPE tokenizer using subset of tokyotech-llm/swallow-code-v2 dataset
codellm_pretrain: Training a Llama 3.2 model using custom tokenizer and data. Two implementations
- torch_titan: It uses torchtitan library for training
- NeMo (WIP)

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
codellm_data		codellm_data
codellm_pretrain		codellm_pretrain
codellm_tokenizer		codellm_tokenizer
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md