Skip to content

Latest commit

 

History

History
41 lines (27 loc) · 1.64 KB

File metadata and controls

41 lines (27 loc) · 1.64 KB

Important

Our multi-node cluster training product is in early preview and not generally accessible. Please contact us for access.


Modal Multinode Training Guide

Well documented examples of running distributed training jobs on Modal. Use this repository to learn how to build distributed training jobs on Modal.

Examples

  • benchmark/ contains performance and reliability testing, using AWS EFA by default.
  • lightning/ a simple lightning.ai Fabric example.
  • nanoGPT/ training Karpathy's nanoGPT reproduction of OpenAI's GPT-2.
  • resnet50/ training a ResNet50 model on the ImageNet dataset.
  • starcoder/ accelerated finetuning of Llama-2-7B on Rust and Go code, supporting either torchrun or accelerate.

Documentation

The multi-node training guide is currently available on Notion: modal-com.notion.site/Multi-node-docs.

Other relevant documentation in our guide:

Demo

multinode-resnet50.online-video-cutter.com.mp4

License

The MIT license.