Write-up:
- https://dudeperf3ct.github.io/projects/train_llm_part0/ (data)
- https://dudeperf3ct.github.io/projects/train_llm_part1/ (tokenizer)
- https://dudeperf3ct.github.io/projects/train_llm_part2/ (pretraining)
codellm_data: Parses and downloads datasets.codellm_tokenizer: Train a custom byte-level BPE tokenizer using subset oftokyotech-llm/swallow-code-v2datasetcodellm_pretrain: Training a Llama 3.2 model using custom tokenizer and data. Two implementationstorch_titan: It usestorchtitanlibrary for training- NeMo (WIP)
This project is licensed under the MIT License - see the LICENSE file for details.