-
Notifications
You must be signed in to change notification settings - Fork 101
Expand file tree
/
Copy pathmeg_gpt2_perf_n16.out
More file actions
346 lines (345 loc) · 25.4 KB
/
meg_gpt2_perf_n16.out
File metadata and controls
346 lines (345 loc) · 25.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
using world size: 64, data-parallel-size: 1, tensor-model-parallel size: 4, pipeline-model-parallel size: 16
using torch.float16 for parameters ...
------------------------ arguments ------------------------
accumulate_allreduce_grads_in_fp32 .............. False
adam_beta1 ...................................... 0.9
adam_beta2 ...................................... 0.999
adam_eps ........................................ 1e-08
adlr_autoresume ................................. False
adlr_autoresume_interval ........................ 1000
apply_query_key_layer_scaling ................... True
apply_residual_connection_post_layernorm ........ False
attention_dropout ............................... 0.1
attention_softmax_in_fp32 ....................... False
bert_binary_head ................................ True
bert_load ....................................... None
bf16 ............................................ False
bias_dropout_fusion ............................. True
bias_gelu_fusion ................................ True
biencoder_projection_dim ........................ 0
biencoder_shared_query_context_model ............ False
block_data_path ................................. None
checkpoint_activations .......................... True
checkpoint_num_layers ........................... 1
clip_grad ....................................... 1.0
consumed_train_samples .......................... 0
consumed_valid_samples .......................... 0
data_impl ....................................... mmap
data_parallel_size .............................. 1
data_path ....................................... ['/gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document']
dataloader_type ................................. single
DDP_impl ........................................ local
decoder_seq_length .............................. None
distribute_checkpointed_activations ............. False
distributed_backend ............................. nccl
embedding_path .................................. None
encoder_seq_length .............................. 1024
eod_mask_loss ................................... False
eval_interval ................................... 100
eval_iters ...................................... 10
evidence_data_path .............................. None
exit_duration_in_mins ........................... None
exit_interval ................................... None
ffn_hidden_size ................................. 32768
finetune ........................................ False
fp16 ............................................ True
fp16_lm_cross_entropy ........................... False
fp32_residual_connection ........................ False
global_batch_size ............................... 1024
hidden_dropout .................................. 0.1
hidden_size ..................................... 8192
hysteresis ...................................... 2
ict_head_size ................................... None
ict_load ........................................ None
img_dim ......................................... 224
indexer_batch_size .............................. 128
indexer_log_interval ............................ 1000
init_method_std ................................. 0.02
init_method_xavier_uniform ...................... False
initial_loss_scale .............................. 4294967296
kv_channels ..................................... 256
layernorm_epsilon ............................... 1e-05
lazy_mpu_init ................................... None
load ............................................ /gpfsscratch/rech/eha/commun/checkpoints/gpt2-1-node
local_rank ...................................... 0
log_batch_size_to_tensorboard ................... False
log_interval .................................... 1
log_learning_rate_to_tensorboard ................ True
log_loss_scale_to_tensorboard ................... True
log_num_zeros_in_grad ........................... False
log_params_norm ................................. False
log_timers_to_tensorboard ....................... False
log_validation_ppl_to_tensorboard ............... False
loss_scale ...................................... None
loss_scale_window ............................... 1000
lr .............................................. 0.00015
lr_decay_iters .................................. 800
lr_decay_samples ................................ None
lr_decay_style .................................. cosine
lr_warmup_fraction .............................. 0.01
lr_warmup_iters ................................. 0
lr_warmup_samples ............................... 0
make_vocab_size_divisible_by .................... 128
mask_prob ....................................... 0.15
masked_softmax_fusion ........................... True
max_position_embeddings ......................... 1024
merge_file ...................................... /gpfsscratch/rech/eha/commun/models-custom/megatron-gpt2/megatron_lm_345m_v0.0/release/gpt2-merges.txt
micro_batch_size ................................ 4
min_loss_scale .................................. 1.0
min_lr .......................................... 1e-05
mmap_warmup ..................................... False
no_load_optim ................................... None
no_load_rng ..................................... None
no_save_optim ................................... None
no_save_rng ..................................... None
num_attention_heads ............................. 32
num_channels .................................... 3
num_classes ..................................... 1000
num_layers ...................................... 64
num_layers_per_virtual_pipeline_stage ........... None
num_workers ..................................... 2
onnx_safe ....................................... None
openai_gelu ..................................... False
optimizer ....................................... adam
override_lr_scheduler ........................... False
params_dtype .................................... torch.float16
patch_dim ....................................... 16
pipeline_model_parallel_size .................... 16
query_in_block_prob ............................. 0.1
rampup_batch_size ............................... None
rank ............................................ 0
reset_attention_mask ............................ False
reset_position_ids .............................. False
retriever_report_topk_accuracies ................ []
retriever_score_scaling ......................... False
retriever_seq_length ............................ 256
sample_rate ..................................... 1.0
save ............................................ /gpfsscratch/rech/eha/commun/checkpoints/gpt2-1-node
save_interval ................................... 500
scatter_gather_tensors_in_pipeline .............. True
seed ............................................ 1234
seq_length ...................................... 1024
sgd_momentum .................................... 0.9
short_seq_prob .................................. 0.1
split ........................................... 949,50,1
tensor_model_parallel_size ...................... 4
tensorboard_dir ................................. None
tensorboard_log_interval ........................ 1
tensorboard_queue_size .......................... 1000
titles_data_path ................................ None
tokenizer_type .................................. GPT2BPETokenizer
train_iters ..................................... 1000
train_samples ................................... None
use_checkpoint_lr_scheduler ..................... False
use_contiguous_buffers_in_ddp ................... False
use_cpu_initialization .......................... None
use_one_sent_docs ............................... False
virtual_pipeline_model_parallel_size ............ None
vocab_extra_ids ................................. 0
vocab_file ...................................... /gpfsscratch/rech/eha/commun/models-custom/megatron-gpt2/megatron_lm_345m_v0.0/release/gpt2-vocab.json
weight_decay .................................... 0.01
world_size ...................................... 64
-------------------- end of arguments ---------------------
setting number of micro-batches to constant 256
> building GPT2BPETokenizer tokenizer ...
> padded vocab (size: 50257) with 431 dummy tokens (new size: 50688)
> initializing torch distributed ...
> initializing tensor model parallel with size 4
> initializing pipeline model parallel with size 16
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
> compiling dataset index builder ...
make: Entering directory '/gpfsdswork/projects/rech/eha/commun/code/megatron-lm/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/gpfsdswork/projects/rech/eha/commun/code/megatron-lm/megatron/data'
>>> done with dataset index builder. Compilation time: 0.109 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /gpfsdswork/projects/rech/eha/commun/code/megatron-lm/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_upper_triang_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /gpfsdswork/projects/rech/eha/commun/code/megatron-lm/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_masked_softmax_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module scaled_masked_softmax_cuda...
Detected CUDA files, patching ldflags
Emitting ninja build file /gpfsdswork/projects/rech/eha/commun/code/megatron-lm/megatron/fused_kernels/build/build.ninja...
Building extension module fused_mix_prec_layer_norm_cuda...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_mix_prec_layer_norm_cuda...
>>> done with compiling and loading fused kernels. Compilation time: 12.395 seconds
time to initialize megatron (seconds): -27.152
[after megatron is initialized] datetime: 2021-05-26 03:20:36
building GPT model ...
> number of parameters on (tensor, pipeline) model parallel rank (2, 1): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 13): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 2): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 11): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 6): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 1): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 2): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 13): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 6): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 6): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 14): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 10): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 14): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 7): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 5): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 5): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 7): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 4): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 5): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 4): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 4): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 1): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 13): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 13): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 1): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 2): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 2): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 6): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 5): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 12): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 3): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 12): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 3): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 10): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 10): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 4): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 7): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 14): 805560320 > number of parameters on (tensor, pipeline) model parallel rank (3, 14): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 10): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 7): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 11): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 11): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 11): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 3): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 9): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 3): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 12): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 12): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 9): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 9): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 8): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 8): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (1, 8): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (0, 8): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (2, 9): 805560320
> number of parameters on (tensor, pipeline) model parallel rank (3, 15): 909385728
> number of parameters on (tensor, pipeline) model parallel rank (3, 0): 917757952
> number of parameters on (tensor, pipeline) model parallel rank (1, 15): 909385728
> number of parameters on (tensor, pipeline) model parallel rank (0, 15): 909385728
> number of parameters on (tensor, pipeline) model parallel rank (1, 0): 917757952
> number of parameters on (tensor, pipeline) model parallel rank (0, 0): 917757952
> number of parameters on (tensor, pipeline) model parallel rank (2, 0): 917757952
> number of parameters on (tensor, pipeline) model parallel rank (2, 15): 909385728
> learning rate decay style: cosine
WARNING: could not find the metadata file /gpfsscratch/rech/eha/commun/checkpoints/gpt2-1-node/latest_checkpointed_iteration.txt
will not load any checkpoints and will start from random
time (ms) | load-checkpoint: 0.47
[after model, optimizer, and learning rate scheduler are built] datetime: 2021-05-26 03:20:37
> building train, validation, and test datasets ...
> datasets target sizes (minimum size):
train: 1024000
validation: 112640
test: 10240
> building train, validation, and test datasets for GPT ...
> building dataset index ...
reading sizes...
reading pointers...
reading document index...
creating numpy buffer of mmap...
creating memory view of numpy buffer...
> finished creating indexed dataset in 0.001563 seconds
number of documents: 10000
> dataset split:
train:
document indices in [0, 9490) total of 9490 documents
validation:
document indices in [9490, 9990) total of 500 documents
test:
document indices in [9990, 10000) total of 10 documents
> loading doc-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_train_indexmap_1024000ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_train_indexmap_1024000ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_train_indexmap_1024000ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.013 seconds
total number of samples: 1024856
total number of epochs: 99
> loading doc-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_valid_indexmap_112640ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_valid_indexmap_112640ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_valid_indexmap_112640ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.002 seconds
total number of samples: 113200
total number of epochs: 182
> loading doc-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_test_indexmap_10240ns_1024sl_1234s_doc_idx.npy
> loading sample-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_test_indexmap_10240ns_1024sl_1234s_sample_idx.npy
> loading shuffle-idx mapping from /gpfsscratch/rech/eha/commun/datasets-custom/openwebtext-10k/meg-gpt2_text_document_test_indexmap_10240ns_1024sl_1234s_shuffle_idx.npy
loaded indexed file in 0.001 seconds
total number of samples: 10255
total number of epochs: 672
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2021-05-26 03:20:38
done with setup ...
training ...
time (ms) | model-and-optimizer-setup: 336.31 | train/valid/test-data-iterators-setup: 494.63
[before the start of training step] datetime: 2021-05-26 03:20:38
iteration 1/ 1000 | consumed samples: 1024 | elapsed time per iteration (ms): 155483.8 | learning rate: 0.000E+00 | global batch size: 1024 | loss scale: 4294967296.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
time (ms) | forward-compute: 32054.16 | forward-recv: 19581.85 | backward-compute: 87380.53 | backward-send: 2.90 | backward-send-forward-recv: 11878.08 | backward-params-all-reduce: 26.06 | backward-embedding-all-reduce: 4181.82 | optimizer-copy-to-main-grad: 8.18 | optimizer-unscale-and-check-inf: 333.72 | optimizer: 342.01 | batch-generator: 182.63
iteration 2/ 1000 | consumed samples: 2048 | elapsed time per iteration (ms): 129028.8 | learning rate: 0.000E+00 | global batch size: 1024 | loss scale: 2147483648.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
time (ms) | forward-compute: 31749.74 | forward-recv: 1493.80 | backward-compute: 88911.73 | backward-send: 2.63 | backward-send-forward-recv: 2267.37 | backward-params-all-reduce: 26.04 | backward-embedding-all-reduce: 4492.62 | optimizer-copy-to-main-grad: 8.16 | optimizer-unscale-and-check-inf: 42.71 | optimizer: 51.07 | batch-generator: 184.46
iteration 3/ 1000 | consumed samples: 3072 | elapsed time per iteration (ms): 128891.5 | learning rate: 0.000E+00 | global batch size: 1024 | loss scale: 1073741824.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
time (ms) | forward-compute: 31888.01 | forward-recv: 1498.23 | backward-compute: 89444.82 | backward-send: 2.80 | backward-send-forward-recv: 1766.94 | backward-params-all-reduce: 26.45 | backward-embedding-all-reduce: 4179.82 | optimizer-copy-to-main-grad: 8.17 | optimizer-unscale-and-check-inf: 42.35 | optimizer: 50.66 | batch-generator: 187.52