|
1 | | -# Multi-node examples |
2 | | -Use these templates for multi-node training. |
3 | | -The main complexity around cluster training is how you submit the SLURM jobs. |
| 1 | +# Multi-node example |
4 | 2 |
|
5 | | -## Test-tube |
6 | | -Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster. |
| 3 | +Run this module to launch a job which runs on 2 nodes each using 2 GPUs. |
7 | 4 |
|
8 | | -To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer |
9 | | -```python |
10 | | -from test_tube import HyperOptArgumentParser |
11 | | - |
12 | | -parser = HyperOptArgumentParser(strategy='grid_search') |
13 | | -parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True) |
14 | | -parser.opt_list('--learning_rate', default=0.001, type=float, |
15 | | - options=[0.0001, 0.0005, 0.001], |
16 | | - tunable=True) |
17 | | - |
18 | | -# give your model a chance to add its own parameters |
19 | | -parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir) |
20 | | - |
21 | | -# parse args |
22 | | -hyperparams = parser.parse_args() |
23 | | -``` |
24 | | - |
25 | | -The above sets up a grid search on learning rate and drop probability. You can now add this object to the |
26 | | -cluster object to perform the grid search: |
27 | | -```python |
28 | | -cluster = SlurmCluster( |
29 | | - hyperparam_optimizer=hyperparams, |
30 | | - log_path='/path/to/log/slurm/files', |
31 | | -) |
32 | | - |
33 | | -# ... configure cluster options |
34 | | - |
35 | | -# run grid search on cluster |
36 | | -nb_trials = 6 # (2 drop probs * 3 lrs) |
37 | | -cluster.optimize_parallel_cluster_gpu( |
38 | | - YourMainFunction, |
39 | | - nb_trials=nb_trials, |
40 | | - job_name=hyperparams.experiment_name |
41 | | -) |
42 | | -``` |
43 | | - |
44 | | -Running the above will launch 6 jobs, each with a different drop prob and learning rate combination. |
45 | | -The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise |
46 | | -Test-Tube will use the ```default=value```. |
47 | | - |
48 | | - |
49 | | -## SLURM Flags |
50 | | -However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll |
51 | | -see a nccl error instead of the actual error which caused the bug. |
52 | | - |
53 | | -```sh |
54 | | -export NCCL_DEBUG=INFO |
55 | | -export PYTHONFAULTHANDLER=1 |
56 | | -``` |
57 | | - |
58 | | -On some clusters you might need to set the network interface with this flag. |
59 | | -```sh |
60 | | -export NCCL_SOCKET_IFNAME=^docker0,lo |
61 | | -``` |
62 | | - |
63 | | -You might also need to load the latest version of NCCL |
64 | | -```sh |
65 | | -module load NCCL/2.4.7-1-cuda.10.0 |
66 | | -``` |
67 | | - |
68 | | -Finally, you must set the master port (usually a random number between 12k and 20k). |
69 | | -```sh |
70 | | -# random port between 12k and 20k |
71 | | -export MASTER_PORT=$((12000 + RANDOM % 20000))$ |
72 | | -``` |
73 | | - |
74 | | -## Simplest example. |
75 | | -1. Modify this script with your CoolModel file. |
76 | | -2. Update and submit [this bash script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/minimal_multi_node_demo_script.sh) |
77 | | -```bash |
78 | | -squeue minimal_multi_node_demo_script.sh |
79 | | -``` |
80 | | - |
81 | | -## Grid search on a cluster |
82 | | - |
83 | | -#### Option 1: Run on cluster using your own SLURM script |
84 | | -The trainer and model will work on a cluster if you configure your SLURM script correctly. |
85 | | - |
86 | | -1. Update [this demo slurm script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/demo_script.sh). |
87 | | -2. Submit the script |
88 | 5 | ```bash |
89 | | -$ squeue demo_script.sh |
90 | | -``` |
91 | | - |
92 | | -Most people have some way they automatically generate their own scripts. |
93 | | -To run a grid search this way, you'd need a way to automatically generate scripts using all the combinations of |
94 | | -hyperparameters to search over. |
95 | | - |
96 | | -#### Option 2: Use test-tube for SLURM script |
97 | | -With test tube we can automatically generate slurm scripts for different hyperparameter options. |
98 | | - |
99 | | -To run this demo: |
100 | | -```bash |
101 | | -source activate YourCondaEnv |
102 | | - |
103 | | -python multi_node_cluster_auto_slurm.py --email [email protected] --gpu_partition your_partition --conda_env YourCondaEnv |
104 | | -``` |
105 | | - |
106 | | -That will submit 6 jobs. Each job will have a specific combination of hyperparams. Each job will also run on 2 nodes |
107 | | -where each node has 8 gpus. |
| 6 | +bash job_submit.sh |
| 7 | +``` |
0 commit comments