Skip to content

Commit e339799

Browse files
Update README.md
1 parent 50f5e4b commit e339799

File tree

1 file changed

+71
-1
lines changed
  • examples/new_project_templates/multi_node_examples

1 file changed

+71
-1
lines changed

examples/new_project_templates/multi_node_examples/README.md

Lines changed: 71 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,75 @@
11
# Multi-node examples
2-
Use these templates for multi-node training
2+
Use these templates for multi-node training.
3+
The main complexity around cluster training is how you submit the SLURM jobs.
4+
5+
## Test-tube
6+
Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster.
7+
8+
To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer
9+
```python
10+
from test_tube import HyperOptArgumentParser
11+
12+
parser = HyperOptArgumentParser(strategy='grid_search')
13+
parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True)
14+
parser.opt_list('--learning_rate', default=0.001, type=float,
15+
options=[0.0001, 0.0005, 0.001],
16+
tunable=True)
17+
18+
# give your model a chance to add its own parameters
19+
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
20+
21+
# parse args
22+
hyperparams = parser.parse_args()
23+
```
24+
25+
The above sets up a grid search on learning rate and drop probability. You can now add this object to the
26+
cluster object to perform the grid search:
27+
```python
28+
cluster = SlurmCluster(
29+
hyperparam_optimizer=hyperparams,
30+
log_path='/path/to/log/slurm/files',
31+
)
32+
33+
# ... configure cluster options
34+
35+
# run grid search on cluster
36+
nb_trials = 6 # (2 drop probs * 3 lrs)
37+
cluster.optimize_parallel_cluster_gpu(
38+
YourMainFunction,
39+
nb_trials=nb_trials,
40+
job_name=hyperparams.experiment_name
41+
)
42+
```
43+
44+
Running the above will launch 6 jobs, each with a different drop prob and learning rate combination.
45+
The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise
46+
Test-Tube will use the ```default=value```.
47+
48+
49+
## SLURM Flags
50+
However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll
51+
see a nccl error instead of the actual error which caused the bug.
52+
53+
```sh
54+
export NCCL_DEBUG=INFO
55+
export PYTHONFAULTHANDLER=1
56+
```
57+
58+
On some clusters you might need to set the network interface with this flag.
59+
```sh
60+
export NCCL_SOCKET_IFNAME=^docker0,lo
61+
```
62+
63+
You might also need to load the latest version of NCCL
64+
```sh
65+
module load NCCL/2.4.7-1-cuda.10.0
66+
```
67+
68+
Finally, you must set the master port (usually a random number between 12k and 20k).
69+
```sh
70+
# random port between 12k and 20k
71+
export MASTER_PORT=$((12000 + RANDOM % 20000))$
72+
```
373

474
## Simplest example.
575
1. Modify this script with your CoolModel file.

0 commit comments

Comments
 (0)