Skip to content

TensorFlow oneDNN build manual for FUJITSU Software Compiler Package (TensorFlow v2.11.0)

kitazawa-yoshit edited this page Mar 22, 2023 · 1 revision

Build Instruction for TensorFlow on Fujitsu Supercomputer PRIMEHPC FX1000/FX700

Table of contents

  1. Introduction
  2. Environment and Prerequisites
  3. Installation Instructions
  4. Troubleshooting
  5. List of Software Version

1. Introduction

This document contains instructions for installing TensorFlow on a Fujitsu Supercomputer PRIMEHPC FX1000 or FX700.
It also provides sample instructions for installing and running several important models optimized for the FX1000 and FX700.

On building TensorFlow, bazel, a building tool used in it, download third-party software from the Internet.
Note that we do not provide the installation instructions, so called "offline installation", for the environments on the systems in the isolated facilities such as corporate laboratories.

1.1. Terminology

The following terms and abbreviations are used in this manual.

Terms/Abbr. Meaning
Online Installation Install TensorFlow on a system with direct access to the Internet (or via proxy)
Target system System on which TensorFlow is to be installed
TCS FX1000's job execution scheduler and compiler library environment (Technical Computing Suite)
CP FX700's compiler library environment (Compiler Package)

2. Environment and prerequisites

2.1. Target system for installation

  • PRIMEHPC FX1000 or FX700
  • For FX700
    • RHEL 8.x or CentOS 8.x must be installed
    • If you want to use FCC, Compiler Package V10L20 must be installed
  • The following packages and commands should be already installed
    make gcc cmake libffi-devel gcc-gfortran numactl git patch unzip tk tcsh tcl lsof python3 pciutils
    (For Mask R-CNN sample model) libxml2 libxslt libxslt-devel libxml2-devel

Please note that building and executing on NFS may cause unexpected problems depending on the performance and configuration of the NFS server.
It is recommended to use locally-attached storage or network storage that is fast enough.

2.2. Directory structure after installation

The directory structure after installation looks like this. The directories PREFIX, VENV_PATH, and TCSDS_PATH are specified in the configuration file env.src. This three directories, and TENSORFLOW_TOP must be independent each other. (Make sure that one directory is not under another directory.)

  PREFIX (where local binaries are stored)
    +- bin (Python, etc.)
    +- lib

  VENV_PATH (location of python modules needed to run TensorFlow)
    +- bin (activate)
    +- lib (packages to be installed by pip)

  TCSDS_PATH (Fujitsu compiler, *: already installed before the procedure)
    +- bin (fcc, FCC, etc.)
    +- lib64

  TENSORFLOW_TOP (complete TensorFlow source or downloaded from https://www.github.com/fujitsu/tensorflow)
    +- tensorflow
    +- third_party
    +- fcc_build_script (TensorFlow Build Scripts)
         +- down (downloaded files will be stored)
         +- sample_script (source for ResNet, OpenNMT, BERT, and Mask RCNN models, and trainig data, will be extracted under here)

2.4. About proxy settings

If your environment requires proxy to the external access, please set the following environment variables.
(Replace "user", "pass", "proxy_url", and "port" with the ones appropriate for your environment.)

$ export http_proxy=http://user:pass@proxy_url:port
$ export https_proxy=https://user:pass@proxy_url:port

Note: curl, wget, git, and pip3 recognize the above environment variables, so edit of rc or .gitconfig is unnecessary.

3. Installation procedure

The general installation flow is as follows:

  1. Preparation (online installation)
  2. Build (online installation)

3.1. Preliminaries (Detail)

3.1-A. Download the source set

$ git clone https://github.com/fujitsu/tensorflow.git
$ cd tensorflow                  # From now on, we'll call this directory TENSORFLOW_TOP
$ git checkout -b r2.11_for_a64fx origin/r2.11_for_a64fx
$ cd fcc_build_script

In the following examples, /home/user/tensorflow is used as TENSORFLOW_TOP.

3.1-B. Edit env.src

'env.src' is configuration file, which is located in $TENSORFLOW_TOP/fcc_build_script.

The configuration is divided into two parts.

  • Control of the Build

    Flag Name Default Value Meaning Remarks
    fjenv_use_venv True Use VENV when true 'false' is not tested.
    fjenv_use_fcc True Use FCC when true, otherwise, use GCC 'false' is not tested.

    Note that these flags are defined as shell variables in 'env.src', but it can also be set as an environment variable outside of 'env.src'. In that case, the environment variable setting takes precedence over the setting in 'env.src'.

  • Set up the building directory.
    For the directory configuration, Refer to the diagram in Chapter 2.3.

    Variable name Meaning Supplemental information
    PREFIX Directory to install the executable generated by this construction procedure.
    VENV_PATH name of the directory where VENV is installed Valid when use_venv=true
    TCSDS_PATH name of the base directory for TCS and CP (base directory: a directory containing bin, lib, etc.) Valid when use_fcc=true

It is not necessary to alter other settings than mentioned above.

3.2. Build (Detail)

3.2-A. Build TensorFlow

Run the shell scripts with name starting number, in numbering order, one after the other.
The following example shows how to install with an interactive shell. The approximate time is shown as a comment in each command (measured on an FX700 2.0GHz 48core).

If you are using the job control system, you can, for example, create a batch script that executes a series of build scripts, and then submit the batch scripts. In that case, it is recommended to add special shell command that enables to terminate the script run on error of the build script (such as set -e in bash).

[option] is an option flag to pass to the script. If omitted, the build is executed.

The scripts are designed so that it will not build again when the binary has already exist. If you want to build again, run each script with rebuild argument.

Please do not confuse with clean. If it is specified, then all the download files are deleted.

$ pwd
/home/user/tensorflow/fcc_build_script          # $TENSORFLOW_TOP/fcc_build_script

$ bash 01_python_build.sh          [option]     # Build and install Python (6 min.)
$ bash 02_bazel_build.sh           [option]     # Install bazel (18 min.)
$ bash 03_make_venv.sh             [option]     # Create VENV (< 1 min.)
$ bash 04_numpy_scipy.sh           [option]     # Build NumPy and SciPy (90 min.)
$ bash 05-1_build_batchedblas.sh   [option]     # Build BatchedBlas (<1 min.)
$ bash 05_tf_build.sh              [option]     # Build TensorFlow (120 min.)
$ bash 07_horovod_install.sh       [option]     # Install Horovod (4 min.)

To verify the build, run the sample model in sample_script/01_resnet.

3.2-B. (Optional) Build Sample Models

The sample models are located in the subdirectory starting with number under sample_script directory. Run the shell scripts with name starting number, in numbering order, one after the other.

The detail of the build and verfication is described in below.
For the verifycation of the speed, since the execution speed of deep learning models can vary by 10~20%, you can use the execution speed described in this manual as a guide, and if it is within the certain range, your build is OK.

CAUTION: The sample models provided here are slightly modified from the originals for operation checks and performance analysis purposes, such as the random number seed may be fixed for profile collection, or the model may be set to abort after a certain number of runs. So please do not use the model as is for actual learning.

Also, please keep in mind that the settings of the sample model is not optimal.

01_resnet

Use the official model (for TensorFlow v1.x) from Google. https://github.com/tensorflow/models/tree/v2.0/official/r1/resnet
Tag: v2.0 (2019/10/15)

$ pwd
/home/user/tensorflow/fcc_build_script/sample_script/01_resnet

$ bash 10_setup_resnet.sh  [option]  # Setup the model (< 1 min.)
$ bash run1proc.sh                   # Run (1 node, 1 proc., 12 cores, use dummy data)
$ bash run1node.sh                   # Run (1 node, 4 proc., 12 cores/proc., use dummy data)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

    $ bash run1proc.sh
	(snip)
    INFO:tensorflow:cross_entropy = 7.4511466, learning_rate = 0.0, train_accuracy = 0.0
    INFO:tensorflow:cross_entropy = 7.4511466, learning_rate = 0.0, train_accuracy = 0.0
    I0318 15:54:36.500785 281473173796960 basic_session_run_hooks.py:265] cross_entropy = 7.4511466, learning_rate = 0.0, train_accuracy = 0.0
    INFO:tensorflow:loss = 8.846681, step = 0
    I0318 15:54:36.504131 281473173796960 basic_session_run_hooks.py:265] loss = 8.846681, step = 0
    INFO:tensorflow:global_step/sec: 0.118669
    I0318 15:54:44.926536 281473173796960 basic_session_run_hooks.py:716] global_step/sec: 0.118669
    INFO:tensorflow:loss = 8.846681, step = 1 (8.423 sec)
    I0318 15:54:44.927290 281473173796960 basic_session_run_hooks.py:263] loss = 8.846681, step = 1 (8.423 sec)
    INFO:tensorflow:global_step/sec: 0.224107
    I0318 15:54:49.388653 281473173796960 basic_session_run_hooks.py:716] global_step/sec: 0.224107
--> INFO:tensorflow:loss = 8.840351, step = 2 (4.462 sec)
    I0318 15:54:49.389361 281473173796960 basic_session_run_hooks.py:263] loss = 8.840351, step = 2 (4.462 sec)
    INFO:tensorflow:global_step/sec: 0.222788
    I0318 15:54:53.877215 281473173796960 basic_session_run_hooks.py:716] global_step/sec: 0.222788
--> INFO:tensorflow:loss = 8.822441, step = 3 (4.489 sec)
	(snip)
    I0318 15:56:19.349525 281473173796960 evaluation.py:250] Starting evaluation at 2023-03-18T15:56:19
    INFO:tensorflow:Graph was finalized.
    I0318 15:56:21.051177 281473173796960 monitored_session.py:240] Graph was finalized.
    INFO:tensorflow:Restoring parameters from /home/user/tensorflow/fcc_build_script/sample_script/01_resnet/run_20230318_155330/model.ckpt-20
    I0318 15:56:21.052141 281473173796960 saver.py:1410] Restoring parameters from /home/user/tensorflow/fcc_build_script/sample_script/01_resnet/run_20230318_155330/model.ckpt-20
    INFO:tensorflow:Running local_init_op.
    I0318 15:56:22.757152 281473173796960 session_manager.py:526] Running local_init_op.
    INFO:tensorflow:Done running local_init_op.
    I0318 15:56:22.860401 281473173796960 session_manager.py:529] Done running local_init_op.
    INFO:tensorflow:step = 1 time = 46.994 [sec]
    I0318 15:57:10.711219 281473173796960 resnet_run_loop.py:760] step = 1 time = 46.994 [sec]
--> INFO:tensorflow:step = 2 time = 45.479 [sec]
    I0318 15:57:56.190541 281473173796960 resnet_run_loop.py:760] step = 2 time = 45.479 [sec]
    INFO:tensorflow:Evaluation [2/20]
    I0318 15:57:56.191018 281473173796960 evaluation.py:163] Evaluation [2/20]
--> INFO:tensorflow:step = 3 time = 45.894 [sec]
	(snip)

The execution time for each step is displayed. First, 10 steps of learning are performed, followed by 20 steps of inference. The first step takes time because it also performs initialization, so check the time taken after the second step.

For FX700 (2.0 GHz), the expected training results for run1proc.sh and run1node.sh is about 5sec, and the expected inference results are about 46 sec,

Note that run1node.sh launches four TensorFlows, and each TensorFlow runs same workload as run1proc.sh, so the overall processing volume is four times larger, that causes them to take slight longer time in each step.

02_OpenNMT

Learn to translate by entering English and German sentences in pairs.

https://github.com/OpenNMT/OpenNMT-tf/tree/v2.30.0
Tag: v2.30.0 (2022/12/12)

$ pwd
/home/user/tensorflow/fcc_build_script/sample_script/02_OpenNMT

$ bash 20_setup_OpenNMT.sh   [options]   # Setup (20 min.)
$ bash run1proc.sh                       # Run the model (1 node, 1 proc., 24 cores, en-de)
$ bash run1node.sh                       # Run the model (1 node, 2 proc., 24 cores/proc, en-de)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

    2023-03-18 16:52:44.091000: I runner.py:290] Number of model weights: 260 (trainable = 260, non trainable = 0)
    2023-03-18 16:52:44.784000: I runner.py:290] Step = 1 ; steps/s = 0.00, source words/s = 0, target words/s = 0 ; Learning rate = 0.000000 ; Loss = 10.416251
    2023-03-18 16:52:48.874000: I training.py:174] Saved checkpoint run_20230318_164914/testrun/ckpt-1
    2023-03-18 16:53:06.891000: I runner.py:290] Step = 2 ; steps/s = 0.00, source words/s = 0, target words/s = 0 ; Learning rate = 0.000000 ; Loss = 10.417956
    2023-03-18 16:53:24.360000: I runner.py:290] Step = 3 ; steps/s = 0.06, source words/s = 245, target words/s = 278 ; Learning rate = 0.000000 ;Loss = 10.416798
    2023-03-18 16:53:41.867000: I runner.py:290] Step = 4 ; steps/s = 0.06, source words/s = 251, target words/s = 280 ; Learning rate = 0.000001 ;Loss = 10.415548
    (snip)
--> 2023-03-18 16:55:27.718000: I runner.py:290] Step = 10 ; steps/s = 0.06, source words/s = 250, target words/s = 271 ; Learning rate = 0.000001 ; Loss = 10.405303
--> 2023-03-18 16:55:45.243000: I runner.py:290] Step = 11 ; steps/s = 0.06, source words/s = 243, target words/s = 282 ; Learning rate = 0.000001 ; Loss = 10.401007
--> 2023-03-18 16:56:02.811000: I runner.py:290] Step = 12 ; steps/s = 0.06, source words/s = 243, target words/s = 276 ; Learning rate = 0.000002 ; Loss = 10.394763
--> 2023-03-18 16:56:20.551000: I runner.py:290] Step = 13 ; steps/s = 0.06, source words/s = 234, target words/s = 276 ; Learning rate = 0.000002 ; Loss = 10.393161
--> 2023-03-18 16:56:38.343000: I runner.py:290] Step = 14 ; steps/s = 0.06, source words/s = 254, target words/s = 272 ; Learning rate = 0.000002 ; Loss = 10.389236
--> 2023-03-18 16:56:55.737000: I runner.py:290] Step = 15 ; steps/s = 0.06, source words/s = 247, target words/s = 284 ; Learning rate = 0.000002 ; Loss = 10.380897
--> 2023-03-18 16:57:13.350000: I runner.py:290] Step = 16 ; steps/s = 0.06, source words/s = 247, target words/s = 278 ; Learning rate = 0.000002 ; Loss = 10.376546
--> 2023-03-18 16:57:31.188000: I runner.py:290] Step = 17 ; steps/s = 0.06, source words/s = 247, target words/s = 268 ; Learning rate = 0.000002 ; Loss = 10.375769

Check the target words/s output at each step. Since performance is unstable for a first few steps, please look at the 10th step and beyond.

On FX700 (2.0GHz), the expected result of run1proc.sh is about 280 target words/sec, and the expected result of run1node.sh is about 480 target words/sec.

03_Bert

Use the official model from Google.

https://github.com/tensorflow/models/tree/v2.11.3/official/legacy/bert
Tag: v2.11.3 (2022/01/19)

Note: Previously, we had provided two tasks, pre-training and fine tuning, but since the arithmetic processing content is almost the same for both, we decided to provide only the pre-training task, which is more computationally challenging one than the other.

$ pwd
/home/user/tensorflow/fcc_build_script/sample_script/03_Bert

$ bash 300_setup_bert.sh                [options]    # Setup (5 min.)
$ bash 311_create_pretraining_data.sh                # Prepare pre-training data (1 min.)
$ bash run1proc.sh                                   # Run pre-training task (1 node, 1 proc., 24 cores)
$ bash run1node.sh                                   # Run pre-training task (1 node, 1 proc., 24 cores/proc)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

    I0318 17:13:16.761196 281473232451680 model_training_utils.py:288] Loading from checkpoint file completed
    I0318 17:14:08.086691 281473232451680 model_training_utils.py:518] Train Step: 1/20  / loss = 11.608283996582031  masked_lm_accuracy = 0.000000 lm_example_loss = 10.982115  next_sentence_accuracy = 0.687500  next_sentence_loss = 0.626170
    I0318 17:14:08.088257 281473232451680 keras_utils.py:145] TimeHistory: 51.18 seconds, 0.94 examples/second between steps 0 and 1
    I0318 17:14:13.006247 281473232451680 model_training_utils.py:518] Train Step: 2/20  / loss = 11.594138145446777  masked_lm_accuracy = 0.000000 lm_example_loss = 10.979274  next_sentence_accuracy = 0.687500  next_sentence_loss = 0.614864
    I0318 17:14:13.006942 281473232451680 keras_utils.py:145] TimeHistory: 4.88 seconds, 9.84 examples/second between steps 1 and 2
    I0318 17:14:17.868535 281473232451680 model_training_utils.py:518] Train Step: 3/20  / loss = 11.417460441589355  masked_lm_accuracy = 0.000000 lm_example_loss = 10.816668  next_sentence_accuracy = 0.729167  next_sentence_loss = 0.600792
        (snip)
--> I0318 17:14:47.386056 281473232451680 keras_utils.py:145] TimeHistory: 4.86 seconds, 9.87 examples/second between steps 8 and 9
    I0318 17:14:52.279267 281473232451680 model_training_utils.py:518] Train Step: 10/20  / loss = 8.686444282531738  masked_lm_accuracy = 0.029586  lm_example_loss = 8.016586  next_sentence_accuracy = 0.458333  next_sentence_loss = 0.669858
--> I0318 17:14:52.279922 281473232451680 keras_utils.py:145] TimeHistory: 4.86 seconds, 9.87 examples/second between steps 9 and 10
    I0318 17:14:57.253912 281473232451680 model_training_utils.py:518] Train Step: 11/20  / loss = 8.3251953125  masked_lm_accuracy = 0.052891  lm_example_loss = 7.692281  next_sentence_accuracy = 0.666667  next_sentence_loss = 0.632915
--> I0318 17:14:57.254559 281473232451680 keras_utils.py:145] TimeHistory: 4.94 seconds, 9.71 examples/second between steps 10 and 11
    I0318 17:15:02.157380 281473232451680 model_training_utils.py:518] Train Step: 12/20  / loss = 8.080007553100586  masked_lm_accuracy = 0.056860  lm_example_loss = 7.410184  next_sentence_accuracy = 0.583333  next_sentence_loss = 0.669823
        (snip)

Check the examples/second output at each step. Since performance is unstable for a first few steps, please look at the 10th step and beyond.

On FX700 (2.0GHz), the expected training results for run1proc.sh and run1node.sh are 9.5~10 examples/sec.

Note that run1node.sh launches four TensorFlows, and each TensorFlow runs same workload as run1proc.sh, so the overall processing volume is four times larger, that causes them to take slight longer time in each step.

Note that in run1node.sh, two processes output their respective results. The overall processing volume is sum of the results. Because of this, the performance of each process is slightly lower than run1proc.sh.

        (snip)
    I0320 09:52:47.262619 281473322301712 keras_utils.py:145] TimeHistory: 4.98 seconds, 9.65 examples/second between steps 11 and 12
    I0320 09:52:47.262978 281473479260432 keras_utils.py:145] TimeHistory: 4.98 seconds, 9.65 examples/second between steps 11 and 12
    I0320 09:52:52.275244 281473322301712 keras_utils.py:145] TimeHistory: 5.00 seconds, 9.60 examples/second between steps 12 and 13
    I0320 09:52:52.276695 281473479260432 keras_utils.py:145] TimeHistory: 5.01 seconds, 9.58 examples/second between steps 12 and 13
        (snip)

04_Mask-R-CNN

Use the official model from Google.

https://github.com/tensorflow/models/tree/master/research/object_detection
Commit id: dc4d11216b (2020/11/8)

$ pwd
/home/user/tensorflow/fcc_build_script/sample_script/04_Mask-R-CNN

$ bash 40_setup_mask-r-cnn.sh              # Setup (25 min.)
$ bash 41_dataset.sh                       # Download the training data (26GB) (3 hours 30 min.)
$ bash run1proc.sh                         # Run (1 node, 1 proc., 24 cores)
$ bash run1node.sh                         # Run (1 node, 1 proc., 24 cores/proc)

Scripts for two more nodes are not provided. Please create your own based on run1node.sh.

The following is the example of result. (See the line pointed with the arrow sign).

    INFO:tensorflow:Step 1 per-step time 190.637s loss=8.223
    INFO:tensorflow:Step 2 per-step time 8.795s loss=8.062
       (snip)
    INFO:tensorflow:Step 20 per-step time 9.553s loss=4.466
--> INFO:tensorflow:Avg per-step time 9.012s Avg per-step batch 0.222

On FX700 (2.0GHz), the expected result for run1proc.sh is around 0.22 batch/sec, and expected result for run1node.sh is around 0.26 batch/sec.

Note that in run1node.sh, two processes output their respective results, but batch/sec is calculated based on the total number of batches. (the output from each process is not strictly the same, because of the elapsed time is measured by each process).

    INFO:tensorflow:Step 19 per-step time 15.236s loss=4.568
    INFO:tensorflow:Step 20 per-step time 14.750s loss=4.464
--> INFO:tensorflow:Avg per-step time 15.018s Avg per-step batch 0.266
    INFO:tensorflow:Step 20 per-step time 14.718s loss=4.543
--> INFO:tensorflow:Avg per-step time 15.024s Avg per-step batch 0.266

4. Troubleshooting

In '04_make_venv.sh', error occurred during building numpy.

Two causes are possible.

  • If you get an error about _ctype, you are missing libffi-devel, try yum install.
  • If you cannot find Fortran, yum gcc-gfortran.

5. List of Software Version

The major softwares version are listed in the table.

Software Version License Remarks
Python 3.9.x (2021/10/4 or thereafter) GPL 'x' depends on the installation date (use the latest commit in the branch 3.9)
TensorFlow 2.11.0 (2022/11/16) Apache 2.0
bazel 5.3.0 (2022/08/23) Apache 2.0
oneDNN v2.7.0 (2022/09/28) Apache 2.0
BatchedBlas 1.0 (2021/2/9) BSD-3
Horovod 0.26.1 (2022/10/14) Apache 2.0
NumPy 1.22.x (2021/12/30 or thereafter) BSD-3 'x' depends on the installation date (use the latest commit in the branch 1.22)
SciPy 1.7.x (2021/6/19 or thereafter) BSD-3 'x' depends on the installation date (use the latest commit in the branch 1.7)

For other software modules, basically the latest available versions at the time of installation is used.

pip3 list

After running the installation script, a file named pip3_list.txt will be generated. The following is the contents of the file after installing TensorFlow and all sample models (as of 3/18/2023). Note that the number of module versions may change depending on the installation date.

Package                Version
---------------------- ---------
absl-py                      1.4.0
astunparse                   1.6.3
beniget                      0.4.1
cachetools                   5.3.0
certifi                      2022.12.7
cffi                         1.15.1
charset-normalizer           3.1.0
cloudpickle                  2.2.1
Cython                       0.29.33
flatbuffers                  23.3.3
future                       0.18.3
gast                         0.4.0
google-auth                  2.16.2
google-auth-oauthlib         0.4.6
google-pasta                 0.2.0
grpcio                       1.51.3
h5py                         3.8.0
horovod                      0.26.1
idna                         3.4
importlib-metadata           6.0.0
keras                        2.11.0
libclang                     15.0.6.1
Markdown                     3.4.1
MarkupSafe                   2.1.2
numpy                        1.22.4
oauthlib                     3.2.2
opt-einsum                   3.3.0
packaging                    23.0
pip                          23.0.1
ply                          3.11
protobuf                     3.19.6
psutil                       5.9.4
pyasn1                       0.4.8
pyasn1-modules               0.2.8
pybind11                     2.10.4
pycparser                    2.21
pythran                      0.12.1
PyYAML                       6.0
requests                     2.28.2
requests-oauthlib            1.3.1
rsa                          4.9
SciPy                        1.7.3
setuptools                   67.6.0
six                          1.16.0
tensorboard                  2.11.2
tensorboard-data-server      0.6.1
tensorboard-plugin-wit       1.8.1
tensorflow                   2.11.0
tensorflow-estimator         2.11.0
tensorflow-io-gcs-filesystem 0.31.0
termcolor                    2.2.0
typing_extensions            4.5.0
urllib3                      1.26.15
Werkzeug                     2.2.3
wheel                        0.40.0
wrapt                        1.15.0
zipp                         3.15.0

Copyright

Copyright RIKEN, Japan 2021-2023
Copyright FUJITSU LIMITED 2021-2023

Clone this wiki locally