LambdaLabsML/torchtitan

Virtual Environment set up

Create your venv:

python -m venv .venv
source .venv/bin/activate

Install torch:

pip install torch --index-url https://download.pytorch.org/whl/cu128

Install dependencies & torchtitan:

cd torchtitan
pip install -r requirements.txt
pip install torchao
pip install -e .

Running lambda fork

sudo nvidia-smi boost-slider --vboost 1
export PYTORCH_ALLOC_CONF=expandable_segments:True
torchrun --nproc-per-node=gpu -m torchtitan.train <config file>

Optimized config files can be found under ./configs

The config files under this directory are generally named like:

<model type>_<size>-<num gpus>x<gpu type>-<seq len>.toml

Here are a couple examples:

./configs/llama3_405b-32xgb300-8k.toml: the Llama 405b recipe for 32 GB300 training on 8k sequence length.
./configs/llama3_70b-16xb200-64k.toml: Llama 70b recipe for 16 B200 gpus with 64k sequence length

Submitting to a slurm cluster

We provide our train.sbatch to launch jobs on slurm clusters, here's a command to launch the training job:

sbatch --nodes <num nodes> train.sbatch --job.config-file <config file>

Name		Name	Last commit message	Last commit date
Latest commit History 1,079 Commits
.ci/docker		.ci/docker
.github		.github
assets		assets
benchmarks		benchmarks
configs		configs
docs		docs
scripts		scripts
tests		tests
torchtitan		torchtitan
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
multinode_trainer.slurm		multinode_trainer.slurm
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_train.sh		run_train.sh
slurm-torchrun.sh		slurm-torchrun.sh
train.sbatch		train.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LambdaLabsML/torchtitan

Virtual Environment set up

Running lambda fork

Submitting to a slurm cluster

About

Uh oh!

Releases

Packages

Languages

License

LambdaLabsML/torchtitan

Folders and files

Latest commit

History

Repository files navigation

LambdaLabsML/torchtitan

Virtual Environment set up

Running lambda fork

Submitting to a slurm cluster

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages