Skip to content

Commit 66667f1

Browse files
Merge pull request #769 from NVIDIA/gh/release
[WideAndDeep/TF] Update for 20.10
2 parents f3c6bdf + 478d565 commit 66667f1

File tree

3 files changed

+50
-34
lines changed

3 files changed

+50
-34
lines changed

TensorFlow/Recommendation/WideAndDeep/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.06-tf1-py3
15+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.10-tf1-py3
1616

1717
FROM ${FROM_IMAGE_NAME}
1818

TensorFlow/Recommendation/WideAndDeep/README.md

Lines changed: 23 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ The differences between this Wide & Deep Recommender Model and the model from th
5252

5353
The model enables you to train a recommender model that combines the memorization of the Wide part and generalization of the Deep part of the network.
5454

55-
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, Turing and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.43 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
55+
This model is trained with mixed precision using Tensor Cores on NVIDIA Volta, Turing and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 1.49 times faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.
5656

5757
### Model architecture
5858

@@ -168,7 +168,7 @@ The following section lists the requirements that you need to meet in order to s
168168

169169
This repository contains Dockerfile which extends the TensorFlow NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:
170170
- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
171-
- [20.06-tf1-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
171+
- [20.10-tf1-py3](https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow) NGC container
172172
- Supported GPUs:
173173
- [NVIDIA Volta architecture](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/)
174174
- [NVIDIA Turing architecture](https://www.nvidia.com/en-us/geforce/turing/)
@@ -283,9 +283,8 @@ These are the important parameters in the `trainer/task.py` script:
283283
--linear_l1_regularization: L1 regularization for the wide part of the model
284284
--linear_l2_regularization: L2 regularization for the wide part of the model
285285
--deep_learning_rate: Learning rate for the deep part of the model
286-
--deep_l1_regularization: L1 regularization for the deep part of the model
287-
--deep_l2_regularization: L2 regularization for the deep part of the model
288286
--deep_dropout: Dropout probability for deep model
287+
--deep_warmup_epochs: Number of epochs with linear learning rate warmup
289288
--predict: Perform only the prediction on the validation set, do not train
290289
--evaluate: Perform only the evaluation on the validation set, do not train
291290
--gpu: Run computations on GPU
@@ -321,7 +320,7 @@ The original data is stored in several separate files:
321320
- `promoted_content.csv` - metadata about the ads
322321
- `document_meta.csv`, `document_topics.csv`, `document_entities.csv`, `document_categories.csv` - metadata about the documents
323322

324-
During the preprocessing stage the data is transformed into 55M rows tabular data of 54 features and eventually saved in pre-batched TFRecord format.
323+
During the preprocessing stage the data is transformed into 59M rows tabular data of 54 features and eventually saved in pre-batched TFRecord format.
325324

326325

327326
#### Spark preprocessing
@@ -357,7 +356,7 @@ For more information about Spark, please refer to the
357356
### Training process
358357

359358
The training can be started by running the `trainer/task.py` script. By default the script is in train mode. Other training related
360-
configs are also present in the `trainer/task.py` and can be seen using the command `python -m trainer.task --help`. Training happens for `--num_epochs` epochs with a custom estimator for the model. The model has a wide linear part and a deep feed forward network, and the networks are built according to the default configuration.
359+
configs are also present in the `trainer/task.py` and can be seen using the command `python -m trainer.task --help`. Training happens for `--num_epochs` epochs with a DNNLinearCombinedClassifier estimator for the model. The model has a wide linear part and a deep feed forward network, and the networks are built according to the default configuration.
361360

362361
Two separate optimizers are used to optimize the wide and the deep part of the network:
363362

@@ -401,23 +400,23 @@ accuracy in training.
401400

402401
##### Training accuracy: NVIDIA DGX A100 (8x A100 40GB)
403402

404-
Our results were obtained by running the benchmark scripts from the `scripts` directory in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
403+
Our results were obtained by running the `trainer/task.py` training script in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs.
405404

406-
|**GPUs**|**Batch size / GPU**|**Accuracy - TF32 (MAP@12)**|**Accuracy - mixed precision (MAP@12)**|**Time to train - TF32 (minutes)**|**Time to train - mixed precision (minutes)**|**Time to train speedup (FP32 to mixed precision)**|
405+
|**GPUs**|**Batch size / GPU**|**Accuracy - TF32 (MAP@12)**|**Accuracy - mixed precision (MAP@12)**|**Time to train - TF32 (minutes)**|**Time to train - mixed precision (minutes)**|**Time to train speedup (TF32 to mixed precision)**|
407406
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
408-
| 1 | 131,072 | 0.67683 | 0.67632 | 312 | 325 | [-](#known-issues) |
409-
| 8 | 16,384 | 0.67709 | 0.67721 | 178 | 188 | [-](#known-issues) |
407+
| 1 | 131,072 | 0.67683 | 0.67632 | 341 | 359 | [-](#known-issues) |
408+
| 8 | 16,384 | 0.67709 | 0.67721 | 93 | 107 | [-](#known-issues) |
410409

411410
To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
412411

413412
##### Training accuracy: NVIDIA DGX-1 (8x V100 16GB)
414413

415-
Our results were obtained by running the benchmark scripts from the `scripts` directory in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
414+
Our results were obtained by running the `trainer/task.py` training script in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs.
416415

417416
|**GPUs**|**Batch size / GPU**|**Accuracy - FP32 (MAP@12)**|**Accuracy - mixed precision (MAP@12)**|**Time to train - FP32 (minutes)**|**Time to train - mixed precision (minutes)**|**Time to train speedup (FP32 to mixed precision)**|
418417
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
419-
| 1 | 131,072 | 0.67648 | 0.67744 | 609 | 426 | 1.429 |
420-
| 8 | 16,384 | 0.67692 | 0.67725 | 233 | 232 | [-](#known-issues) |
418+
| 1 | 131,072 | 0.67648 | 0.67744 | 654 | 440 | 1.49 |
419+
| 8 | 16,384 | 0.67692 | 0.67725 | 190 | 185 | 1.03 |
421420

422421
To achieve the same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
423422

@@ -430,7 +429,7 @@ Models trained with FP32, TF32 and Automatic Mixed Precision (AMP) achieve simil
430429
##### Training stability test
431430

432431
The Wide and Deep model was trained for 54,713 training steps, starting
433-
from 6 different initial random seeds for each setup. The training was performed in the 20.06-tf1-py3 NGC container on
432+
from 6 different initial random seeds for each setup. The training was performed in the 20.10-tf1-py3 NGC container on
434433
NVIDIA DGX A100 40GB and DGX-1 16GB machines with and without mixed precision enabled.
435434
After training, the models were evaluated on the validation set. The following
436435
table summarizes the final MAP@12 score on the validation set.
@@ -448,32 +447,29 @@ table summarizes the final MAP@12 score on the validation set.
448447

449448
##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
450449

451-
Our results were obtained by running the `trainer/task.py` training script in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs. Performance numbers (in samples per second) were averaged over 50 training iterations. Improving model scaling for multi-GPU is [under development](#known-issues).
450+
Our results were obtained by running the benchmark scripts from the `scripts` directory in the TensorFlow NGC container on NVIDIA DGX A100 with (8x A100 40GB) GPUs. Improving model scaling for multi-GPU is [under development](#known-issues).
452451

453-
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
454-
455-
|**GPUs**|**Batch size / GPU**|**Throughput - TF32 (samples/s)**|**Throughput - mixed precision (samples/s)**|**Strong scaling - FP32**|**Strong scaling - mixed precision**|
452+
|**GPUs**|**Batch size / GPU**|**Throughput - TF32 (samples/s)**|**Throughput - mixed precision (samples/s)**|**Strong scaling - TF32**|**Strong scaling - mixed precision**|
456453
|-------:|-------------------:|----------------------------:|---------------------------------------:|----------------------:|---------------------------------:|
457-
| 1 | 131,072 | 352,904 | 338,356 | 1.00 | 1.00 |
458-
| 8 | 16,384 | 617,910 | 584,688 | 1.75 | 1.73 |
459-
454+
| 1 | 131,072 | 349,879 | 332,529 | 1.00 | 1.00 |
455+
| 8 | 16,384 | 1,283,457 | 1,111,976 | 3.67 | 3.34 |
460456

461457
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
462458

463-
Our results were obtained by running the `trainer/task.py` training script in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Performance numbers (in samples per second) were averaged over 50 training iterations. Improving model scaling for multi-GPU is planned, see [known issues](#known-issues).
464-
465-
To achieve these same results, follow the steps in the [Quick Start Guide](#quick-start-guide).
459+
Our results were obtained by running the benchmark scripts from the `scripts` directory in the TensorFlow NGC container on NVIDIA DGX-1 with (8x V100 16GB) GPUs. Improving model scaling for multi-GPU is [under development](#known-issues).
466460

467461
|**GPUs**|**Batch size / GPU**|**Throughput - FP32 (samples/s)**|**Throughput - mixed precision (samples/s)**|**Throughput speedup (FP32 to mixed precision)**|**Strong scaling - FP32**|**Strong scaling - mixed precision**|
468462
|-------:|-------------------:|----------------------------:|---------------------------------------:|-----------------------------------------------:|----------------------:|---------------------------------:|
469-
| 1 | 131,072 | 180,561 | 257,995 | 1.429 | 1.00 | 1.00 |
470-
| 8 | 16,384 | 472,143 | 473,195 | 1.002 | 2.61 | 1.83 |
471-
463+
| 1 | 131,072 | 182,510 | 271,366 | 1.49 | 1.00 | 1.00 |
464+
| 8 | 16,384 | 626,301 | 643,334 | 1.03 | 3.43 | 2.37 |
472465

473466
## Release notes
474467

475468
### Changelog
476469

470+
November 2020
471+
- Updated performance tables to include numbers from 20.10-tf1-py3 NGC container
472+
477473
June 2020
478474
- Updated performance tables to include A100 results
479475

TensorFlow/Recommendation/WideAndDeep/trainer/task.py

Lines changed: 26 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
import os
2525
import tensorflow as tf
2626
import tensorflow_transform as tft
27+
from tensorflow.core.protobuf import rewriter_config_pb2
2728
from trainer import features
2829
from utils.dataloader import separate_input_fn
2930
from utils.hooks.benchmark_hooks import BenchmarkLoggingHook
@@ -311,10 +312,21 @@ def main(FLAGS):
311312
json.dump(vars(FLAGS), f, indent=4)
312313

313314
if FLAGS.gpu:
314-
session_config = tf.compat.v1.ConfigProto(log_device_placement=FLAGS.log_device_placement)
315+
if FLAGS.amp:
316+
rewrite_options = rewriter_config_pb2.RewriterConfig(auto_mixed_precision=True)
317+
session_config = tf.compat.v1.ConfigProto(
318+
graph_options=tf.compat.v1.GraphOptions(rewrite_options=rewrite_options),
319+
log_device_placement=FLAGS.log_device_placement
320+
)
321+
else:
322+
session_config = tf.compat.v1.ConfigProto(
323+
log_device_placement=FLAGS.log_device_placement
324+
)
315325
else:
316-
session_config = tf.compat.v1.ConfigProto(device_count={'GPU': 0},
317-
log_device_placement=FLAGS.log_device_placement)
326+
session_config = tf.compat.v1.ConfigProto(
327+
device_count={'GPU': 0},
328+
log_device_placement=FLAGS.log_device_placement
329+
)
318330

319331
if FLAGS.hvd:
320332
session_config.gpu_options.visible_device_list = str(hvd.local_rank())
@@ -332,9 +344,15 @@ def main(FLAGS):
332344
print('Steps per epoch: {}'.format(steps_per_epoch))
333345
max_steps = int(FLAGS.num_epochs * steps_per_epoch)
334346

347+
save_checkpoints_steps = FLAGS.benchmark_steps + 1 if FLAGS.benchmark else \
348+
int(FLAGS.eval_epoch_interval * steps_per_epoch)
349+
count_steps = FLAGS.benchmark_steps + 1 if FLAGS.benchmark else 100
350+
335351
run_config = tf.estimator.RunConfig(model_dir=model_dir) \
336352
.replace(session_config=session_config,
337-
save_checkpoints_steps=int(FLAGS.eval_epoch_interval * steps_per_epoch),
353+
save_checkpoints_steps=save_checkpoints_steps,
354+
save_summary_steps=count_steps,
355+
log_step_count_steps=count_steps,
338356
keep_checkpoint_max=1)
339357

340358
def wide_optimizer():
@@ -345,7 +363,8 @@ def wide_optimizer():
345363
if FLAGS.hvd:
346364
opt = hvd.DistributedOptimizer(opt)
347365
if FLAGS.amp:
348-
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
366+
loss_scale = tf.train.experimental.DynamicLossScale()
367+
opt = tf.compat.v1.train.experimental.MixedPrecisionLossScaleOptimizer(opt, loss_scale)
349368
return opt
350369

351370
def deep_optimizer():
@@ -362,7 +381,8 @@ def deep_optimizer():
362381
if FLAGS.hvd:
363382
opt = hvd.DistributedOptimizer(opt)
364383
if FLAGS.amp:
365-
opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
384+
loss_scale = tf.train.experimental.DynamicLossScale()
385+
opt = tf.compat.v1.train.experimental.MixedPrecisionLossScaleOptimizer(opt, loss_scale)
366386
return opt
367387

368388
# input functions to read data from disk

0 commit comments

Comments
 (0)