Skip to content

Commit c76ce98

Browse files
authored
Update PrincetonUTutorial.md
- "conda activate" is preferred to "source activate since v4.4 (December 2017) https://www.anaconda.com/how-to-get-ready-for-the-release-of-conda-4-4/ - Replace "python -m tensorflow.tensorboard ..." with "python -m tensorboard.main" (when / which version changed this?)
1 parent 29650c6 commit c76ce98

File tree

1 file changed

+18
-13
lines changed

1 file changed

+18
-13
lines changed

docs/PrincetonUTutorial.md

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ After that, create an isolated Anaconda environment and load CUDA drivers, an MP
2222
#cd plasma-python
2323
module load anaconda3
2424
conda create --name my_env --file requirements-travis.txt
25-
source activate my_env
25+
conda activate my_env
2626
2727
export OMPI_MCA_btl="tcp,self,vader"
2828
# replace "vader" with "sm" for OpenMPI versions prior to 3.0.0
@@ -42,7 +42,7 @@ Currently Loaded Modulefiles:
4242
Next, install the `plasma-python` package:
4343

4444
```bash
45-
#source activate my_env
45+
#conda activate my_env
4646
python setup.py install
4747
```
4848

@@ -57,7 +57,7 @@ $ which mpicc
5757
/usr/local/openmpi/cuda-8.0/3.0.0/intel170/x86_64/bin/mpicc
5858
```
5959

60-
If you `source activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
60+
If you `conda activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
6161

6262
#### Location of the data on Tigress
6363

@@ -104,7 +104,7 @@ For batch analysis, make sure to allocate 1 MPI process per GPU. Save the follow
104104
#SBATCH --mem-per-cpu=0
105105
106106
module load anaconda3
107-
source activate my_env
107+
conda activate my_env
108108
export OMPI_MCA_btl="tcp,self,vader"
109109
module load cudatoolkit cudann
110110
module load openmpi/cuda-8.0/intel-17.0/3.0.0/64
@@ -148,7 +148,10 @@ Then, launch the application from the command line:
148148
mpirun -N 4 python mpi_learn.py
149149
```
150150
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session.
151-
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same. HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
151+
152+
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same.)
153+
154+
[//]: # (HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
152155

153156
[//]: # (Consistent with https://www.open-mpi.org/faq/?category=slurm ?)
154157

@@ -210,20 +213,23 @@ A regular FRNN run will produce several outputs and callbacks.
210213

211214
Currently supports graph visualization, histograms of weights, activations and biases, and scalar variable summaries of losses and accuracies.
212215

213-
The summaries are written real time to `/tigress/<netid>/Graph`. For MacOS, you can set up the `sshfs` mount of /tigress filesystem and view those summaries in your browser.
216+
The summaries are written in real time to `/tigress/<netid>/Graph`. For macOS, you can set up the `sshfs` mount of the `/tigress` filesystem and view those summaries in your browser.
214217

215-
For Mac, you could follow the instructions here:
218+
To install SSHFS on a macOS system, you could follow the instructions here:
216219
https://github.com/osxfuse/osxfuse/wiki/SSHFS
220+
Or use [Homebrew](https://brew.sh/), `brew cask install osxfuse; brew install sshfs`. Note, to install and/or use `osxfuse` you may need to enable its kernel extension in: System Preferences → Security & Privacy → General
217221

218222
then do something like:
219223
```
220-
sshfs -o allow_other,defer_permissions [email protected]:/tigress/netid/ /mnt/<destination folder name on your laptop>/
224+
sshfs -o allow_other,defer_permissions [email protected]:/tigress/<netid>/ <destination folder name on your laptop>/
221225
```
222226

223-
Launch TensorBoard locally:
227+
Launch TensorBoard locally (assuming that it is installed on your local computer):
224228
```
225-
python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your laptop>/Graph
229+
python -m tensorboard.main --logdir <destination folder name on your laptop>/Graph
226230
```
231+
A URL should be emitted to the console output. Navigate to this link in your browser. If the TensorBoard interface does not open, try directing your browser to `localhost:6006`.
232+
227233
You should see something like:
228234

229235
![tensorboard example](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/tb.png)
@@ -237,7 +243,7 @@ python performance_analysis.py
237243
```
238244
this uses the resulting file produced as a result of training the neural network as an input, and produces several `.png` files with plots as an output.
239245

240-
In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/netid/csv_logs` (each run will produce a new log file with a timestamp in name).
246+
In addition, you can check the scalar variable summaries for training loss, validation loss and validation ROC logged at `/tigress/<netid>/csv_logs` (each run will produce a new log file with a timestamp in name).
241247

242248
A sample code to analyze can be found in `examples/notebooks`. For instance:
243249

@@ -266,5 +272,4 @@ show(p, notebook_handle=True)
266272

267273
### Learning curve summaries per mini-batch
268274

269-
To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level:
270-
https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb
275+
To extract per mini-batch summaries, use the output produced by FRNN logged to the standard out (in case of the batch jobs, it will all be contained in the Slurm output file). Refer to the following notebook to perform the analysis of learning curve on a mini-batch level: [FRNN_scaling.ipynb](https://github.com/PPPLDeepLearning/plasma-python/blob/master/examples/notebooks/FRNN_scaling.ipynb)

0 commit comments

Comments
 (0)