Skip to content

Commit e739c79

Browse files
committed
cleaned up demos
1 parent 94f89e8 commit e739c79

File tree

6 files changed

+14
-395
lines changed

6 files changed

+14
-395
lines changed
Lines changed: 4 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -1,107 +1,7 @@
1-
# Multi-node examples
2-
Use these templates for multi-node training.
3-
The main complexity around cluster training is how you submit the SLURM jobs.
1+
# Multi-node example
42

5-
## Test-tube
6-
Lightning uses test-tube to submit SLURM jobs and to run hyperparameter searches on a cluster.
3+
Run this module to launch a job which runs on 2 nodes each using 2 GPUs.
74

8-
To run a hyperparameter search, we normally add the values to search to the Hyperparameter optimizer
9-
```python
10-
from test_tube import HyperOptArgumentParser
11-
12-
parser = HyperOptArgumentParser(strategy='grid_search')
13-
parser.opt_list('--drop_prob', default=0.2, options=[0.2, 0.5], type=float, tunable=True)
14-
parser.opt_list('--learning_rate', default=0.001, type=float,
15-
options=[0.0001, 0.0005, 0.001],
16-
tunable=True)
17-
18-
# give your model a chance to add its own parameters
19-
parser = LightningTemplateModel.add_model_specific_args(parent_parser, root_dir)
20-
21-
# parse args
22-
hyperparams = parser.parse_args()
23-
```
24-
25-
The above sets up a grid search on learning rate and drop probability. You can now add this object to the
26-
cluster object to perform the grid search:
27-
```python
28-
cluster = SlurmCluster(
29-
hyperparam_optimizer=hyperparams,
30-
log_path='/path/to/log/slurm/files',
31-
)
32-
33-
# ... configure cluster options
34-
35-
# run grid search on cluster
36-
nb_trials = 6 # (2 drop probs * 3 lrs)
37-
cluster.optimize_parallel_cluster_gpu(
38-
YourMainFunction,
39-
nb_trials=nb_trials,
40-
job_name=hyperparams.experiment_name
41-
)
42-
```
43-
44-
Running the above will launch 6 jobs, each with a different drop prob and learning rate combination.
45-
The ```tunable``` parameter must be set to True to add that argument to the space of options, otherwise
46-
Test-Tube will use the ```default=value```.
47-
48-
49-
## SLURM Flags
50-
However you decide to submit your jobs, debugging requires a few flags. Without these flags, you'll
51-
see a nccl error instead of the actual error which caused the bug.
52-
53-
```sh
54-
export NCCL_DEBUG=INFO
55-
export PYTHONFAULTHANDLER=1
56-
```
57-
58-
On some clusters you might need to set the network interface with this flag.
59-
```sh
60-
export NCCL_SOCKET_IFNAME=^docker0,lo
61-
```
62-
63-
You might also need to load the latest version of NCCL
64-
```sh
65-
module load NCCL/2.4.7-1-cuda.10.0
66-
```
67-
68-
Finally, you must set the master port (usually a random number between 12k and 20k).
69-
```sh
70-
# random port between 12k and 20k
71-
export MASTER_PORT=$((12000 + RANDOM % 20000))$
72-
```
73-
74-
## Simplest example.
75-
1. Modify this script with your CoolModel file.
76-
2. Update and submit [this bash script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/minimal_multi_node_demo_script.sh)
77-
```bash
78-
squeue minimal_multi_node_demo_script.sh
79-
```
80-
81-
## Grid search on a cluster
82-
83-
#### Option 1: Run on cluster using your own SLURM script
84-
The trainer and model will work on a cluster if you configure your SLURM script correctly.
85-
86-
1. Update [this demo slurm script](https://github.com/williamFalcon/pytorch-lightning/blob/master/examples/new_project_templates/multi_node_examples/demo_script.sh).
87-
2. Submit the script
885
```bash
89-
$ squeue demo_script.sh
90-
```
91-
92-
Most people have some way they automatically generate their own scripts.
93-
To run a grid search this way, you'd need a way to automatically generate scripts using all the combinations of
94-
hyperparameters to search over.
95-
96-
#### Option 2: Use test-tube for SLURM script
97-
With test tube we can automatically generate slurm scripts for different hyperparameter options.
98-
99-
To run this demo:
100-
```bash
101-
source activate YourCondaEnv
102-
103-
python multi_node_cluster_auto_slurm.py --email [email protected] --gpu_partition your_partition --conda_env YourCondaEnv
104-
```
105-
106-
That will submit 6 jobs. Each job will have a specific combination of hyperparams. Each job will also run on 2 nodes
107-
where each node has 8 gpus.
6+
bash job_submit.sh
7+
```

examples/multi_node_examples/demo_script.sh

Lines changed: 0 additions & 66 deletions
This file was deleted.

examples/multi_node_examples/minimal_multi_node_demo_script.sh renamed to examples/multi_node_examples/job_submit.sh

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
#!/bin/bash -l
22

33
# SLURM SUBMIT SCRIPT
4-
#SBATCH --nodes=4
5-
#SBATCH --gres=gpu:4
6-
#SBATCH --ntasks-per-node=4
4+
#SBATCH --nodes=2
5+
#SBATCH --gres=gpu:2
6+
#SBATCH --ntasks-per-node=2
77
#SBATCH --mem=0
88
#SBATCH --time=0-02:00:00
99

@@ -23,8 +23,5 @@ conda activate my_env
2323
# module load NCCL/2.4.7-1-cuda.10.0
2424
# -------------------------
2525

26-
# random port between 12k and 20k
27-
export MASTER_PORT=$((12000 + RANDOM % 20000))
28-
2926
# run script from above
30-
python minimal_multi_node_demo.py
27+
python multi_node_demo.py

examples/multi_node_examples/minimal_multi_node_demo.py

Lines changed: 0 additions & 24 deletions
This file was deleted.

0 commit comments

Comments
 (0)