Name	Name	Last commit message	Last commit date
parent directory ..
code	code
cover	cover
images	images
01Introduction.md	01Introduction.md
01Introduction.pdf	01Introduction.pdf
01Introduction.pptx	01Introduction.pptx
02DeepSpeed.pdf	02DeepSpeed.pdf
02DeepSpeed.pptx	02DeepSpeed.pptx
02SPTD.md	02SPTD.md
03CP.md	03CP.md
04EP.md	04EP.md
1220Code01DDP.ipynb	1220Code01DDP.ipynb
1220Code01DDP.md	1220Code01DDP.md
Code01DDP.ipynb	Code01DDP.ipynb
Code01DDP.md	Code01DDP.md
Code02MP.ipynb	Code02MP.ipynb
Code02MP.md	Code02MP.md
README.md	README.md

Name

Last commit message

Last commit date

分布式并行基础

分布式训练可以将模型训练任务分配到多个计算节点上,从而加速训练过程并处理更大的数据集。模型是一个有机的整体，简单增加机器数量并不能提升算力，需要有并行策略和通信设计，才能实现高效的并行训练。本节将会重点打开业界主流的分布式并行框架 DeepSpeed、Megatron-LM 的核心多维并行的特性来进行原理介绍。

内容大纲

大纲	小节	链接	状态
分布式并行	01 分布式并行框架介绍	PPT, 视频
分布式并行	02 DeepSpeed 介绍	PPT, 视频
💖	🌟	💖
并行实践 💻	CODE 01: 从零构建 PyTorch DDP	Markdown, Jupyter, 文章	✅
并行实践 💻	CODE 02: PyTorch 实现模型并行	Markdown, Jupyter, 文章	✅

备注

文字课程内容正在一节节补充更新，每晚会抽空继续更新正在 AI Infra ，希望您多多鼓励和参与进来！！！

文字课程开源在 AI Infra，系列视频托管B 站和油管，PPT 开源在github，欢迎引用！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

分布式并行基础

内容大纲

备注

FilesExpand file tree

01ParallelBegin

Directory actions

More options

Directory actions

More options

Latest commit

History

01ParallelBegin

Folders and files

parent directory

README.md

分布式并行基础

内容大纲

备注