You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you `source activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
60
+
If you `conda activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
61
61
62
62
#### Location of the data on Tigress
63
63
@@ -104,7 +104,7 @@ For batch analysis, make sure to allocate 1 MPI process per GPU. Save the follow
104
104
#SBATCH --mem-per-cpu=0
105
105
106
106
module load anaconda3
107
-
source activate my_env
107
+
conda activate my_env
108
108
export OMPI_MCA_btl="tcp,self,vader"
109
109
module load cudatoolkit cudann
110
110
module load openmpi/cuda-8.0/intel-17.0/3.0.0/64
@@ -148,7 +148,10 @@ Then, launch the application from the command line:
148
148
mpirun -N 4 python mpi_learn.py
149
149
```
150
150
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session.
151
-
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same. HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
151
+
152
+
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same.)
153
+
154
+
[//]: # (HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
152
155
153
156
[//]: # (Consistent with https://www.open-mpi.org/faq/?category=slurm ?)
154
157
@@ -210,20 +213,23 @@ A regular FRNN run will produce several outputs and callbacks.
210
213
211
214
Currently supports graph visualization, histograms of weights, activations and biases, and scalar variable summaries of losses and accuracies.
212
215
213
-
The summaries are written real time to `/tigress/<netid>/Graph`. For MacOS, you can set up the `sshfs` mount of /tigress filesystem and view those summaries in your browser.
216
+
The summaries are written in real time to `/tigress/<netid>/Graph`. For macOS, you can set up the `sshfs` mount of the `/tigress` filesystem and view those summaries in your browser.
214
217
215
-
For Mac, you could follow the instructions here:
218
+
To install SSHFS on a macOS system, you could follow the instructions here:
216
219
https://github.com/osxfuse/osxfuse/wiki/SSHFS
220
+
Or use [Homebrew](https://brew.sh/), `brew cask install osxfuse; brew install sshfs`. Note, to install and/or use `osxfuse` you may need to enable its kernel extension in: System Preferences → Security & Privacy → General
217
221
218
222
then do something like:
219
223
```
220
-
sshfs -o allow_other,defer_permissions [email protected]:/tigress/netid/ /mnt/<destination folder name on your laptop>/
224
+
sshfs -o allow_other,defer_permissions [email protected]:/tigress/<netid>/ <destination folder name on your laptop>/
221
225
```
222
226
223
-
Launch TensorBoard locally:
227
+
Launch TensorBoard locally (assuming that it is installed on your local computer):
224
228
```
225
-
python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your laptop>/Graph
229
+
python -m tensorboard.main --logdir <destination folder name on your laptop>/Graph
226
230
```
231
+
A URL should be emitted to the console output. Navigate to this link in your browser. If the TensorBoard interface does not open, try directing your browser to `localhost:6006`.
this uses the resulting file produced as a result of training the neural network as an input, and produces several `.png` files with plots as an output.
239
245
240
-
In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/netid/csv_logs` (each run will produce a new log file with a timestamp in name).
246
+
In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/<netid>/csv_logs` (each run will produce a new log file with a timestamp in name).
241
247
242
248
A sample code to analyze can be found in `examples/notebooks`. For instance:
243
249
@@ -266,5 +272,4 @@ show(p, notebook_handle=True)
266
272
267
273
### Learning curve summaries per mini-batch
268
274
269
-
To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level:
To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level: [FRNN_scaling.ipynb](https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb)
0 commit comments