|
7 | 7 | (Alpha release - usage might change later) |
8 | 8 |
|
9 | 9 | The tensorlayer.cli.train module provides the ``tl train`` subcommand. |
10 | | -It helps the user bootstrap a TensorFlow/TensorLayer program for distributed training |
| 10 | +It helps the user bootstrap a TensorFlow/TensorLayer program for distributed training |
11 | 11 | using multiple GPU cards or CPUs on a computer. |
12 | 12 |
|
13 | | -You need to first setup the `CUDA_VISIBLE_DEVICES <http://acceleware.com/blog/cudavisibledevices-masking-gpus>`_ |
| 13 | +You need to first setup the `CUDA_VISIBLE_DEVICES <http://acceleware.com/blog/cudavisibledevices-masking-gpus>`_ |
14 | 14 | to tell ``tl train`` which GPUs are available. If the CUDA_VISIBLE_DEVICES is not given, |
15 | | -``tl train`` would try best to discover all available GPUs. |
| 15 | +``tl train`` would try best to discover all available GPUs. |
16 | 16 |
|
17 | 17 | In distribute training, each TensorFlow program needs a TF_CONFIG environment variable to describe |
18 | | -the cluster. It also needs a master daemon to |
| 18 | +the cluster. It also needs a master daemon to |
19 | 19 | monitor all trainers. ``tl train`` is responsible |
20 | | -for automatically managing these two tasks. |
| 20 | +for automatically managing these two tasks. |
21 | 21 |
|
22 | 22 | Usage |
23 | 23 | ----- |
24 | 24 |
|
25 | 25 | tl train [-h] [-p NUM_PSS] [-c CPU_TRAINERS] <file> [args [args ...]] |
26 | 26 |
|
27 | 27 | .. code-block:: bash |
28 | | - |
| 28 | +
|
29 | 29 | # example of using GPU 0 and 1 for training mnist |
30 | 30 | CUDA_VISIBLE_DEVICES="0,1" |
31 | 31 | tl train example/tutorial_mnist_distributed.py |
|
56 | 56 | ----- |
57 | 57 | A parallel training program would require multiple parameter servers |
58 | 58 | to help parallel trainers to exchange intermediate gradients. |
59 | | -The best number of parameter servers is often proportional to the |
| 59 | +The best number of parameter servers is often proportional to the |
60 | 60 | size of your model as well as the number of CPUs available. |
61 | 61 | You can control the number of parameter servers using the ``-p`` parameter. |
62 | 62 |
|
63 | 63 | If you have a single computer with massive CPUs, you can use the ``-c`` parameter |
64 | 64 | to enable CPU-only parallel training. |
65 | | -The reason we are not supporting GPU-CPU co-training is because GPU and |
| 65 | +The reason we are not supporting GPU-CPU co-training is because GPU and |
66 | 66 | CPU are running at different speeds. Using them together in training would |
67 | 67 | incur stragglers. |
68 | 68 |
|
|
0 commit comments