Skip to content

Commit 187f54a

Browse files
committed
2 parents 8ab983f + 6cd21ce commit 187f54a

38 files changed

+302
-217
lines changed

workflows/nt3_mlrMBO/README.md

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@ the Swift script to launch a NT3 run, and to
2424

2525
For each run of the benchmark model, the following is produced:
2626

27-
* `run.json` - a json file containing data describing the individual run: the
27+
* `run.[run_id].json` - a json file containing data describing the individual run: the
2828
parameters for that run and per epoch details such as the validation loss. This
2929
file will be written to the output directory for that particular run (e.g.)
30-
`nt3_mlrMBO/experiments/E1/run_1_1_0/output/run.json`.
30+
`nt3_mlrMBO/experiments/E1/run_1_1_0/output/run.1.1.0.json`.
3131

3232

3333
## User requirements ##
@@ -286,22 +286,68 @@ cd Supervisor/workflows/nt3_mlrMBO/ext/EQ-R/eqr
286286

287287
Launching the workflow:
288288

289-
Edit
290-
`cori_workflow3.sh` setting the relevant variables as appropriate. All easily
289+
1. Make a copy of `cori_workflow3.sh`
290+
2. Edit the copy setting the relevant variables there
291+
as appropriate. All easily
291292
changed settings are delineated by the `USER SETTINGS START` and `USER SETTINGS END`
292293
markers. Note that these variables can be easily overwritten from the calling
293294
environment (use `export` in your shell). By default these are set up for a short-ish
294295
debugging runs and will need to be changed for a production run.
296+
3. `source cori_settings.sh`
297+
4. Run the workflow by running your workflow script, passing an experiment id.
295298

296299
An example:
297300

298301
```
299302
cd Supervisor/workflows/nt3_mlrMBO/swift
303+
cp cori_workflow3.sh my_cori_workflow.sh
304+
# edit my my_cori_workflow.sh
300305
source cori_settings.sh
301-
./cori_workflow.sh T1
306+
./my_cori_workflow.sh T1
302307
```
303308
where T1 is the experiment ID.
304309

305310
### Running on Theta ###
306311

307-
TODO
312+
* Download, install etc. the user requirements listed at the top of this
313+
document.
314+
315+
All the system requirements (see above) have been installed on Theta for except
316+
for the EQ/R swift extension.
317+
318+
* Compile the EQ/R swift-t extension.
319+
```
320+
cd Supervisor/workflows/nt3_mlrMBO/ext/EQ-R/eqr
321+
./bootstrap
322+
source ./theta_build_settings.sh
323+
./configure
324+
make install
325+
```
326+
327+
Launching the workflow:
328+
329+
1. Make a copy of `theta_workflow.sh`
330+
2. Edit the copy setting the relevant variables there
331+
as appropriate. All easily
332+
changed settings are delineated by the `USER SETTINGS START` and `USER SETTINGS END`
333+
markers. Note that these variables can be easily overwritten from the calling
334+
environment (use `export` in your shell). By default these are set up for a short-ish
335+
debugging runs and will need to be changed for a production run.
336+
3. Run the workflow by running your workflow script, passing an experiment id.
337+
338+
An example:
339+
340+
```
341+
cd Supervisor/workflows/nt3_mlrMBO/swift
342+
cp theta_workflow.sh my_theta_workflow.sh
343+
# edit my theta_workflow.sh if necesasry
344+
./theta_workflow.sh T1
345+
```
346+
347+
where T1 is the experiment ID.
348+
349+
Note that Theta use the _ai_-version of the workflow. The benchmark is launched
350+
using Supervisor/workflows/nt3_mlrMBO/scripts/theta_run_model.sh. In there, the
351+
`PYTHONHOME` shell variable can be changed to specify a different python installation to
352+
run the model with. If you do change the python installation, the python
353+
system requirements mentioned above will need to be satisfied.

workflows/nt3_mlrMBO/swift/ai_workflow.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ fi
5757
#export TURBINE_LOG=1 TURBINE_DEBUG=1 ADLB_DEBUG=1
5858

5959
export EXPID=$1
60-
export TURBINE_OUTPUT=$EMEWS_PROJECT_ROOT/experiments/$EXPID
60+
export TURBINE_OUTPUT_ROOT=${TURBINE_OUTPUT_ROOT:-$EMEWS_PROJECT_ROOT/experiments}
61+
export TURBINE_OUTPUT=$TURBINE_OUTPUT_ROOT/$EXPID
6162
check_directory_exists
6263

6364
export TURBINE_JOBNAME="${EXPID}_job"

workflows/nt3_mlrMBO/swift/cori_workflow.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,8 @@ fi
5555
#export TURBINE_LOG=1 TURBINE_DEBUG=1 ADLB_DEBUG=1
5656

5757
export EXPID=$1
58-
export TURBINE_OUTPUT=$EMEWS_PROJECT_ROOT/experiments/$EXPID
58+
export TURBINE_OUTPUT_ROOT=${TURBINE_OUTPUT_ROOT:-$EMEWS_PROJECT_ROOT/experiments}
59+
export TURBINE_OUTPUT=$TURBINE_OUTPUT_ROOT/$EXPID
5960
check_directory_exists
6061

6162
export TURBINE_JOBNAME="${EXPID}_job"

workflows/nt3_mlrMBO/swift/cori_workflow3.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -55,8 +55,8 @@ fi
5555
#export TURBINE_LOG=1 TURBINE_DEBUG=1 ADLB_DEBUG=1
5656

5757
export EXPID=$1
58-
#export TURBINE_OUTPUT=$EMEWS_PROJECT_ROOT/experiments/$EXPID
59-
export TURBINE_OUTPUT=/project/projectdirs/m2759/pbalapra/experiments/$EXPID
58+
export TURBINE_OUTPUT_ROOT=${TURBINE_OUTPUT_ROOT:-$EMEWS_PROJECT_ROOT/experiments}
59+
export TURBINE_OUTPUT=$TURBINE_OUTPUT_ROOT/$EXPID
6060
check_directory_exists
6161

6262
export TURBINE_JOBNAME="${EXPID}_job"

workflows/nt3_mlrMBO/swift/theta_workflow.sh

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@ export PROCS=${PROCS:-320}
2323
# Cori has 32 cores per node, 128GB per node
2424
export PPN=${PPN:-1}
2525

26-
#export QUEUE="default"
27-
export QUEUE="R.candle_res"
26+
export QUEUE=${QUEUE:-default}
2827
export WALLTIME=${WALLTIME:-05:00:00}
2928

3029

@@ -59,8 +58,8 @@ fi
5958
export TURBINE_LOG=1 TURBINE_DEBUG=1 ADLB_DEBUG=1
6059

6160
export EXPID=$1
62-
export TURBINE_OUTPUT=/lus/theta-fs0/projects/Candle_ECP/experiments/$EXPID
63-
#export TURBINE_OUTPUT=$EMEWS_PROJECT_ROOT/experiments/$EXPID
61+
export TURBINE_OUTPUT_ROOT=${TURBINE_OUTPUT_ROOT:-$EMEWS_PROJECT_ROOT/experiments}
62+
export TURBINE_OUTPUT=$TURBINE_OUTPUT_ROOT/$EXPID
6463
check_directory_exists
6564

6665
export TURBINE_JOBNAME="${EXPID}_job"

workflows/nt3_mlrMBO/swift/workflow.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ fi
5757
#export TURBINE_LOG=1 TURBINE_DEBUG=1 ADLB_DEBUG=1
5858

5959
export EXPID=$1
60-
export TURBINE_OUTPUT=$EMEWS_PROJECT_ROOT/experiments/$EXPID
60+
export TURBINE_OUTPUT_ROOT=${TURBINE_OUTPUT_ROOT:-$EMEWS_PROJECT_ROOT/experiments}
61+
export TURBINE_OUTPUT=$TURBINE_OUTPUT_ROOT/$EXPID
6162
check_directory_exists
6263

6364
export TURBINE_JOBNAME="${EXPID}_job"
@@ -80,7 +81,7 @@ export RESIDENT_WORK_RANKS=$(( PROCS - 2 ))
8081
EQR=$EMEWS_PROJECT_ROOT/ext/EQ-R
8182

8283
CMD_LINE_ARGS="$* -pp=$PROPOSE_POINTS -mi=$MAX_ITERATIONS -mb=$MAX_BUDGET -ds=$DESIGN_SIZE "
83-
CMD_LINE_ARGS+="-param_set_file=$PARAM_SET_FILE -model_name=$MODEL_NAME -script_file=$SCRIPT_FILE"
84+
CMD_LINE_ARGS+="-param_set_file=$PARAM_SET_FILE -model_name=$MODEL_NAME -script_file=$SCRIPT_FILE -exp_id=$EXPID"
8485

8586
if [ -n "$MACHINE" ]; then
8687
MACHINE="-m $MACHINE"
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
{
22
"parameters":
33
{
4-
"1": [2,4,6],
5-
"2": [15, 25,50,75],
6-
"3": [2000, 1000],
7-
"4": [600, 400]
4+
"epochs": [2, 4, 8 ],
5+
"batch_size": [20, 40],
6+
"N1": [1000, 2000],
7+
"NE": [500]
88
}
99
}

workflows/p1b1_grid/python/computeStats.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,12 @@ def computeStats(swiftArrayAsString):
1515
for a in A:
1616
vals += [A[a]]
1717
print('%d values, with min=%f, max=%f, avg=%f\n'%(len(vals),min(vals),max(vals),sum(vals)/float(len(vals))))
18+
19+
filename = os.environ['TURBINE_OUTPUT']+ "/final_stats.txt"
20+
# writing the val loss to the output file
21+
with open(filename, 'w') as the_file:
22+
the_file.write('%d values, with min=%f, max=%f, avg=%f\n'%(len(vals),min(vals),max(vals),sum(vals)/float(len(vals))))
23+
1824

1925

2026
if (len(sys.argv) < 2):

workflows/p1b1_grid/python/determineParameters.py

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,15 @@ def loadSettings(settingsFilename):
1212
print("PWD is: '%s'" % os.getcwd())
1313
sys.exit(1)
1414
try:
15-
params = settings['parameters']
15+
epochs = settings['parameters']["epochs"]
16+
batch_size = settings['parameters']["batch_size"]
17+
N1 = settings['parameters']["N1"]
18+
NE = settings['parameters']["NE"]
19+
1620
except KeyError as e:
1721
print("Settings file (%s) does not contain key: %s" % (settingsFilename, str(e)))
1822
sys.exit(1)
19-
return(params)
23+
return(epochs, batch_size, N1, NE)
2024

2125
def expand(Vs, fr, to, soFar):
2226
soFarNew = []
@@ -40,16 +44,11 @@ def expand(Vs, fr, to, soFar):
4044
settingsFilename = sys.argv[1]
4145
paramsFilename = sys.argv[2]
4246

43-
params = loadSettings(settingsFilename)
44-
values = {}
45-
for i in range(1, len(params)+1):
46-
try:
47-
As = params[str(i)]
48-
except:
49-
print('Did not find parameter %i in settings file'%i)
50-
sys.exit(1)
51-
values[i] = As
52-
results = expand(values, 1, len(params), [''])
47+
epochs, batch_size, N1, NE = loadSettings(settingsFilename)
48+
49+
values = {1:epochs, 2: batch_size, 3: N1, 4: NE}
50+
print values
51+
results = expand(values, 1, len(values), [''])
5352
result = ':'.join(results)
5453

5554
with open(paramsFilename, 'w') as the_file:
Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import sys
22
import p1b1_runner
3-
import json
4-
3+
import json, os
4+
import socket
55

66
if (len(sys.argv) < 3):
77
print('requires arg1=param and arg2=filename')
@@ -12,7 +12,7 @@
1212

1313
# print (parameterString)
1414
print ("filename is " + filename)
15-
15+
print (socket.gethostname())
1616

1717
integs = [int(x) for x in parameterString.split(',')]
1818
print (integs)
@@ -21,16 +21,25 @@
2121
hyper_parameter_map['framework'] = 'keras'
2222
hyper_parameter_map['batch_size'] = integs[1]
2323
hyper_parameter_map['dense'] = [integs[2], integs[3]]
24-
hyper_parameter_map['save'] = './output'
25-
24+
hyper_parameter_map['run_id'] = parameterString
25+
# hyper_parameter_map['instance_directory'] = os.environ['TURBINE_OUTPUT']
26+
hyper_parameter_map['save'] = os.environ['TURBINE_OUTPUT']+ "/output-"+os.environ['PMI_RANK']
27+
sys.argv = ['p1b1_runner']
2628
val_loss = p1b1_runner.run(hyper_parameter_map)
2729
print (val_loss)
30+
31+
sfn = os.environ['TURBINE_OUTPUT']+ "/output-"+os.environ['PMI_RANK'] + "/procname-" + parameterString
32+
with open(sfn, 'w') as sfile:
33+
sfile.write(socket.getfqdn())
34+
proc_id = "-"+ str(os.getpid())
35+
sfile.write(proc_id)
36+
2837
# works around this error:
2938
# https://github.com/tensorflow/tensorflow/issues/3388
3039
from keras import backend as K
3140
K.clear_session()
3241

33-
# writing the val loss to the output file
42+
# writing the val loss to the output file (result-*)
3443
with open(filename, 'w') as the_file:
3544
the_file.write(repr(val_loss))
3645

0 commit comments

Comments
 (0)