Skip to content

Commit e269e20

Browse files
committed
Merge branch 'sc21' into 'main'
scripts for sc21 See merge request ADLR/megatron-lm!298
2 parents 6a68098 + de7dc40 commit e269e20

File tree

13 files changed

+669
-0
lines changed

13 files changed

+669
-0
lines changed

examples/sc21/CONFIG.sh

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/bin/bash
2+
3+
4+
# SLURM options.
5+
export SLURM_PARTITION=<slurm partition, used to feed -p option in slurm>
6+
export SLURM_ACCOUNT=<slurm account, used to feed -A option in slurm>
7+
8+
9+
# Source code.
10+
export MEGATRON_CODE_DIR=<megatron source code directory>
11+
12+
13+
# This variable is used to mount the relevant part of the filesystem
14+
# inside the docker container. Note that the `MEGATRON_CODE_DIR` and the
15+
# launch directory already get mounted; this variable should be used to
16+
# mount the directories that contain the data and tokenizer files.
17+
export DOCKER_MOUNT_DIR=<megatron dataset and bpe tokenizer vocab path>
18+
19+
20+
# Data and tokenizer files.
21+
MEGATRON_DATA=<path to megatron processed data>
22+
BPE_VOCAB_FILE=<path to bpe vocab file>
23+
BPE_MERGE_FILE=<path to bpe merges file>
24+
25+
26+
# Megatron input parameters.
27+
# `MEGATRON_EXTRA_PARAMS` can be used to provide any extra parameters
28+
# that are not listed here.
29+
export MEGATRON_PARAMS=" ${MEGATRON_EXTRA_PARAMS} \
30+
--tensor-model-parallel-size ${TP} \
31+
--pipeline-model-parallel-size ${PP} \
32+
--micro-batch-size ${MBS} \
33+
--global-batch-size ${GBS} \
34+
--num-layers ${NLS} \
35+
--hidden-size ${HS} \
36+
--num-attention-heads ${NAH} \
37+
--DDP-impl ${DDP} \
38+
--data-path ${MEGATRON_DATA} \
39+
--vocab-file ${BPE_VOCAB_FILE} \
40+
--merge-file ${BPE_MERGE_FILE} \
41+
--log-interval 5 \
42+
--seq-length 2048 \
43+
--max-position-embeddings 2048 \
44+
--train-iters 500 \
45+
--lr-decay-iters 320 \
46+
--lr 0.0001 \
47+
--min-lr 0.00001 \
48+
--lr-decay-style cosine \
49+
--lr-warmup-fraction 0.01 \
50+
--split 969,30,1 \
51+
--eval-iters 100 \
52+
--eval-interval 1000 \
53+
--clip-grad 1.0 \
54+
--fp16 \
55+
--loss-scale 8192 "
56+
57+

examples/sc21/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# Reproducing Figures in SC21 Paper
2+
3+
4+
This directory contains some of the scripts that were used to produce the
5+
results in the [Megatron paper](https://arxiv.org/pdf/2104.04473.pdf) that is
6+
to appear at [SuperComputing 2021](https://sc21.supercomputing.org/). These
7+
scripts use [Slurm](https://slurm.schedmd.com/documentation.html) with the
8+
[pyxis plugin](https://github.com/NVIDIA/pyxis), but can be modified for other
9+
schedulers as well.
10+
11+
12+
## Setup
13+
14+
All the cluster-dependent variables are in [`CONFIG.sh`](./CONFIG.sh). Please
15+
update the unspecified values (in angle brackets `<...>`) before launching any
16+
scripts.
17+
18+
19+
20+
## Scripts
21+
22+
Below is a list of scripts that can be used to reproduce various figures in our
23+
[paper](https://arxiv.org/pdf/2104.04473.pdf):
24+
25+
* [run_table_1.sh](./run_table_1.sh): Table 1 showing weak-scaling throughput
26+
for GPT models ranging from 1 billion to 1 trillion parameters.
27+
* [run_figure_11.sh](./run_figure_11.sh): Figure 11 showing the weak-scaling
28+
performance of pipeline parallelism.
29+
* [run_figure_12.sh](./run_figure_12.sh): Figure 12 showing the effect of
30+
the interleaved schedule on a 175B GPT model.
31+
* [run_figure_13.sh](./run_figure_13.sh): Figure 13 showing the effect of
32+
different degrees of pipeline and tensor model parallelism on a model with
33+
162.2 billion parameters.
34+
* [run_figure_14.sh](./run_figure_14.sh): Figure 14 showing the effect of
35+
different degrees of data and pipeline model parallelism on a model with
36+
5.9 billion parameters.
37+
* [run_figure_15.sh](./run_figure_15.sh): Figure 15 showing the effect of
38+
different degrees of data and tensor model parallelism on a model with
39+
5.9 billion parameters.
40+
* [run_figure_16.sh](./run_figure_16.sh): Figure 16 showing the effect of
41+
microbatch size.
42+
* [run_figure_17.sh](./run_figure_17.sh): Figure 17 showing the effect of
43+
activation recomputation.
44+
* [run_figure_18.sh](./run_figure_18.sh): Figure 18 showing the effect of
45+
the scatter-gather communication optimization.

examples/sc21/SBATCH.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/bin/bash
2+
3+
4+
sbatch -p ${SLURM_PARTITION} \
5+
-A ${SLURM_ACCOUNT} \
6+
--job-name=${JOB_NAME} \
7+
--nodes=${NNODES} \
8+
--export=MEGATRON_CODE_DIR,MEGATRON_PARAMS,DOCKER_MOUNT_DIR SRUN.sh
9+
10+
exit 0
11+
12+
13+

examples/sc21/SRUN.sh

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
#!/bin/bash
2+
3+
#SBATCH -t 0:30:00 --exclusive --mem=0 --overcommit --ntasks-per-node=8
4+
5+
6+
THIS_DIR=`pwd`
7+
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
8+
mkdir -p ${THIS_DIR}/logs
9+
10+
11+
CMD="python -u ${MEGATRON_CODE_DIR}/pretrain_gpt.py ${MEGATRON_PARAMS}"
12+
13+
14+
srun -l \
15+
--container-image "nvcr.io#nvidia/pytorch:20.12-py3" \
16+
--container-mounts "${THIS_DIR}:${THIS_DIR},${MEGATRON_CODE_DIR}:${MEGATRON_CODE_DIR},${DOCKER_MOUNT_DIR}:${DOCKER_MOUNT_DIR}" \
17+
--output=${THIS_DIR}/logs/%x_%j_$DATETIME.log sh -c "${CMD}"
18+

examples/sc21/run_figure_11.sh

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/bash
2+
3+
# ================================
4+
# Choose the case to run.
5+
# ================================
6+
7+
# Pipeline-parallel size options = [1, 2, 4, 8].
8+
PP=1
9+
10+
# Batch size (global batch size) options = [8, 128].
11+
GBS=8
12+
13+
14+
15+
16+
17+
# Set pipeline-parallel size options.
18+
NLS=$((3*PP))
19+
NNODES=${PP}
20+
21+
22+
# Other params.
23+
TP=8
24+
MBS=1
25+
HS=20480
26+
NAH=128
27+
DDP=local
28+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
29+
30+
31+
# Name of the job.
32+
export JOB_NAME=results_figure_11_pipeline_parallel_size_${PP}_batch_size_${GBS}
33+
34+
35+
# Import the configs.
36+
. `pwd`/CONFIG.sh
37+
38+
39+
# Submit the job.
40+
. `pwd`/SBATCH.sh
41+
42+
43+
exit 0
44+
45+
46+

examples/sc21/run_figure_12.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#!/bin/bash
2+
3+
# ================================
4+
# Choose the case to run.
5+
# ================================
6+
7+
# Interleaved schedule options = [YES, NO].
8+
INTERLEAVED=YES
9+
10+
# Batch size (global batch size) options = [12, 24, 36, ..., 60].
11+
GBS=12
12+
13+
14+
15+
16+
17+
# Set interleaved schedule options.
18+
if [ ${INTERLEAVED} == "YES" ]; then
19+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 "
20+
elif [ ${INTERLEAVED} == "NO" ]; then
21+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
22+
else
23+
echo "Invalid configuration"
24+
exit 1
25+
fi
26+
27+
28+
# Other params.
29+
TP=8
30+
PP=12
31+
MBS=1
32+
NLS=96
33+
HS=12288
34+
NAH=96
35+
DDP=local
36+
NNODES=12
37+
38+
39+
# Name of the job.
40+
export JOB_NAME=results_figure_12_interleaved_${INTERLEAVED}_batch_size_${GBS}
41+
42+
43+
# Import the configs.
44+
. `pwd`/CONFIG.sh
45+
46+
47+
# Submit the job.
48+
. `pwd`/SBATCH.sh
49+
50+
51+
exit 0
52+
53+
54+

examples/sc21/run_figure_13.sh

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
#!/bin/bash
2+
3+
# ================================
4+
# Choose the case to run.
5+
# ================================
6+
7+
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
8+
PP=2
9+
10+
# Batch size (global batch size) options = [32, 128].
11+
GBS=32
12+
13+
14+
15+
16+
17+
# Set pipeline-parallel and tensor-parallel size options.
18+
TP=$((64/PP))
19+
20+
21+
# Other params.
22+
MBS=1
23+
NLS=32
24+
HS=20480
25+
NAH=128
26+
DDP=local
27+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
28+
NNODES=8
29+
30+
31+
# Name of the job.
32+
export JOB_NAME=results_figure_13_pipeline_parallel_size_${PP}_tensor_parallel_size_${TP}_batch_size_${GBS}
33+
34+
35+
# Import the configs.
36+
. `pwd`/CONFIG.sh
37+
38+
39+
# Submit the job.
40+
. `pwd`/SBATCH.sh
41+
42+
43+
exit 0
44+
45+
46+

examples/sc21/run_figure_14.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/bin/bash
2+
3+
# ================================
4+
# Choose the case to run.
5+
# ================================
6+
7+
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
8+
PP=2
9+
10+
# Batch size (global batch size) options = [32, 512].
11+
GBS=32
12+
13+
14+
15+
16+
17+
# Set pipeline-parallel and data-parallel size options.
18+
DP=$((64/PP))
19+
20+
21+
# Other params.
22+
TP=1
23+
MBS=1
24+
NLS=32
25+
HS=3840
26+
NAH=32
27+
DDP=local
28+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
29+
NNODES=8
30+
31+
32+
# Name of the job.
33+
export JOB_NAME=results_figure_14_pipeline_parallel_size_${PP}_data_parallel_size_${DP}_batch_size_${GBS}
34+
35+
36+
# Import the configs.
37+
. `pwd`/CONFIG.sh
38+
39+
40+
# Submit the job.
41+
. `pwd`/SBATCH.sh
42+
43+
44+
exit 0
45+
46+
47+

examples/sc21/run_figure_15.sh

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
#!/bin/bash
2+
3+
# ================================
4+
# Choose the case to run.
5+
# ================================
6+
7+
# Tensor-parallel size options = [2, 4, 8, 16, 32].
8+
TP=2
9+
10+
# Batch size (global batch size) options = [32, 128, 512].
11+
GBS=32
12+
13+
14+
15+
16+
17+
# Set tensor-parallel and data-parallel size options.
18+
DP=$((64/TP))
19+
20+
21+
# Other params.
22+
PP=1
23+
MBS=1
24+
NLS=32
25+
HS=3840
26+
NAH=32
27+
DDP=local
28+
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
29+
NNODES=8
30+
31+
32+
# Name of the job.
33+
export JOB_NAME=results_figure_15_tensor_parallel_size_${TP}_data_parallel_size_${DP}_batch_size_${GBS}
34+
35+
36+
# Import the configs.
37+
. `pwd`/CONFIG.sh
38+
39+
40+
# Submit the job.
41+
. `pwd`/SBATCH.sh
42+
43+
44+
exit 0
45+
46+
47+

0 commit comments

Comments
 (0)