|
1 | 1 | # Multi-node examples |
2 | | -Use these templates for multi-node training |
| 2 | +Use these templates for multi-node training. |
| 3 | +The main complexity around cluster training is how you submit the SLURM jobs. |
| 4 | + |
| 5 | +## Test-tube |
| 6 | +Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster. |
| 7 | + |
| 8 | +To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer |
| 9 | +```python |
| 10 | +from test_tube import HyperOptArgumentParser |
| 11 | + |
| 12 | +parser = HyperOptArgumentParser(strategy='grid_search') |
| 13 | +parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True) |
| 14 | +parser.opt_list('--learning_rate', default=0.001, type=float, |
| 15 | + options=[0.0001, 0.0005, 0.001], |
| 16 | + tunable=True) |
| 17 | + |
| 18 | +# give your model a chance to add its own parameters |
| 19 | +parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir) |
| 20 | + |
| 21 | +# parse args |
| 22 | +hyperparams = parser.parse_args() |
| 23 | +``` |
| 24 | + |
| 25 | +The above sets up a grid search on learning rate and drop probability. You can now add this object to the |
| 26 | +cluster object to perform the grid search: |
| 27 | +```python |
| 28 | +cluster = SlurmCluster( |
| 29 | + hyperparam_optimizer=hyperparams, |
| 30 | + log_path='/path/to/log/slurm/files', |
| 31 | +) |
| 32 | + |
| 33 | +# ... configure cluster options |
| 34 | + |
| 35 | +# run grid search on cluster |
| 36 | +nb_trials = 6 # (2 drop probs * 3 lrs) |
| 37 | +cluster.optimize_parallel_cluster_gpu( |
| 38 | + YourMainFunction, |
| 39 | + nb_trials=nb_trials, |
| 40 | + job_name=hyperparams.experiment_name |
| 41 | +) |
| 42 | +``` |
| 43 | + |
| 44 | +Running the above will launch 6 jobs, each with a different drop prob and learning rate combination. |
| 45 | +The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise |
| 46 | +Test-Tube will use the ```default=value```. |
| 47 | + |
| 48 | + |
| 49 | +## SLURM Flags |
| 50 | +However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll |
| 51 | +see a nccl error instead of the actual error which caused the bug. |
| 52 | + |
| 53 | +```sh |
| 54 | +export NCCL_DEBUG=INFO |
| 55 | +export PYTHONFAULTHANDLER=1 |
| 56 | +``` |
| 57 | + |
| 58 | +On some clusters you might need to set the network interface with this flag. |
| 59 | +```sh |
| 60 | +export NCCL_SOCKET_IFNAME=^docker0,lo |
| 61 | +``` |
| 62 | + |
| 63 | +You might also need to load the latest version of NCCL |
| 64 | +```sh |
| 65 | +module load NCCL/2.4.7-1-cuda.10.0 |
| 66 | +``` |
| 67 | + |
| 68 | +Finally, you must set the master port (usually a random number between 12k and 20k). |
| 69 | +```sh |
| 70 | +# random port between 12k and 20k |
| 71 | +export MASTER_PORT=$((12000 + RANDOM % 20000))$ |
| 72 | +``` |
3 | 73 |
|
4 | 74 | ## Simplest example. |
5 | 75 | 1. Modify this script with your CoolModel file. |
|
0 commit comments