-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathreproduce_experiments.py
More file actions
636 lines (535 loc) · 38.1 KB
/
reproduce_experiments.py
File metadata and controls
636 lines (535 loc) · 38.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
"""
Reproduce AutoMon's experiments. The script downloads the code from https://github.com/hsivan/automon, downloads the
external datasets, runs the experiments, generates the paper's figures, and finally compiles the paper's Latex source
with the newly generated figures.
Run this script on Linux (Ubuntu 18.04 or later).
Estimated simulation's runtimes are based on measurements on an Intel i9-7900X at 3.3GHz with 64GB RAM, running Ubuntu
18.04 with MKL 2019 Update 3.
Requirements:
(1) Python >= 3.8
(2) Docker engine (see https://docs.docker.com/engine/install/ubuntu, and make sure running
'sudo docker run hello-world' works as expected)
The script uses only libraries from the Python standard library, which prevents the need to install external packages.
It installs TexLive for the compilation of the paper's Latex source files.
Note: in case the script is called with the --aws options it installs AWS cli and configures it.
After running the simulations, it runs the distributed experiments on AWS.
It requires the user to have an AWS account; after opening the account, the user must create AWS IAM user with
AdministratorAccess permissions. The full steps to create the IAM user:
1. In https://console.aws.amazon.com/iam/ select "Users" on the left and press "Add users".
2. Provide user name, e.g. automon, and mark the "Access key" checkbox and press "Next: Permissions".
3. Expand the "Set permissions boundary" section and select "Use a permissions boundary to control the maximum user permissions".
Use the Filter to search for AdministratorAccess and then select AdministratorAccess from the list and press "Next: Tags".
4. No need to add tags so press "Next: Review".
5. In the next page press "Create user".
6. Press "Download.csv" to download the new_user_credentials.csv file.
7. Download the project (if you haven't already) using the --dd flag, and place the new_user_credentials.csv file in
<project_root>/aws_experiments/new_user_credentials.csv.
After completing these steps, re-run this script.
Running the AWS experiments would cost a few hundred dollars!
After completion of the experiments and verification of the results, the user should clean AWS resources allocated by
this script by running the cleanup script <project_root>/aws_experiments/aws_cleanup.py.
"""
import urllib.request
import zipfile
import shutil
import os
import sys
import gzip
import subprocess
from pathlib import Path
from timeit import default_timer as timer
import datetime
import argparse
from argparse import RawTextHelpFormatter
def print_to_std_and_file(str_to_log):
print(str_to_log)
with open(log_file, 'a') as f:
test_timestamp = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
f.write(test_timestamp + ": " + str_to_log + "\n")
def verify_requirements():
"""
Verifies the requirements for running the script - Python >= 3.8 and docker engine installed
:return:
"""
if not (sys.version_info.major == 3 and sys.version_info.minor >= 8):
print_to_std_and_file("Running this script requires Python >= 3.8, current version is " + str(sys.version_info.major) + "." + str(sys.version_info.minor))
sys.exit(1)
result = subprocess.run('docker version', shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
if '\'docker\' is not recognized as an internal or external command' in result.stderr.decode():
print_to_std_and_file("Running this script requires docker engine. See https://docs.docker.com/engine/install.")
sys.exit(1)
def download_repository():
"""
Checks if the script is part of AutoMon's cloned project or a standalone.
If the script is part of a cloned project there is no need to download the source code.
Otherwise, downloads AutoMon's code from GitHub.
:return: the location of the source code (the project_root)
"""
script_abs_dir = os.path.abspath(__file__).replace("reproduce_experiments.py", "")
if os.path.isfile(script_abs_dir + "./requirements.txt"):
print_to_std_and_file("The reproduce_experiments.py script is part of a cloned AutoMon project. No need to download AutoMon's source code.")
project_root = '..'
return project_root
print_to_std_and_file("The reproduce_experiments.py script is a standalone. Downloading AutoMon's source code.")
zipped_project = 'automon-main.zip'
project_root = script_abs_dir + '/' + zipped_project.replace(".zip", "")
urllib.request.urlretrieve('https://github.com/hsivan/automon/archive/refs/heads/main.zip', zipped_project)
with zipfile.ZipFile(zipped_project, 'r') as zip_ref:
zip_ref.extractall('automon-main-temp')
os.remove(zipped_project)
if os.path.isdir(project_root):
# Source code already exists. In order to get updates from the repository merge the downloaded repository to existing automon-main folder
shutil.copytree('automon-main-temp/automon-main', 'automon-main', dirs_exist_ok=True)
print_to_std_and_file("Found existing local project " + project_root + ". Merged repository updates to it.")
else:
shutil.move('automon-main-temp/automon-main', 'automon-main')
print_to_std_and_file("Downloaded the project to " + project_root)
shutil.rmtree('automon-main-temp')
return project_root
def download_air_quality_dataset(project_root):
"""
Downloads the Air Quality dataset and copies it to the dataset's folder in the project folder
:return:
"""
# Check if the dataset already exists
if os.path.isdir(project_root + '/datasets/air_quality/') and len([f for f in os.listdir(project_root + '/datasets/air_quality/') if "PRSA_Data" in f]) == 12:
return
zipped_dataset = 'PRSA2017_Data_20130301-20170228.zip'
inner_dataset_folder = "PRSA_Data_20130301-20170228"
# Download the file zipped_dataset, save it locally, extract it and copy the csv files to dataset_root
urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/00501/' + zipped_dataset, zipped_dataset)
with zipfile.ZipFile(zipped_dataset, 'r') as zip_ref:
zip_ref.extractall()
for f in os.listdir(inner_dataset_folder):
shutil.copyfile(inner_dataset_folder + "/" + f, project_root + '/datasets/air_quality/' + f)
os.remove(zipped_dataset)
shutil.rmtree(inner_dataset_folder)
def download_intrusion_detection_dataset(project_root):
"""
Downloads the Intrusion Detection dataset and copies it to the dataset's folder in the project folder
:return:
"""
gz_files = ['kddcup.data_10_percent.gz', 'corrected.gz']
target_file_names = ['kddcup.data_10_percent_corrected', 'corrected']
for i, gz_file in enumerate(gz_files):
# Check if the file already exists
if os.path.isfile(project_root + '/datasets/intrusion_detection/' + gz_file.replace(".gz", "")):
continue
urllib.request.urlretrieve('http://kdd.ics.uci.edu/databases/kddcup99/' + gz_file, gz_file)
with gzip.open(gz_file, 'rb') as f_in:
with open(project_root + '/datasets/intrusion_detection/' + target_file_names[i], 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
os.remove(gz_file)
def download_external_datasets(project_root):
"""
Downloads the Air Quality and the Intrusion Detection datasets
:param project_root:
:return:
"""
download_air_quality_dataset(project_root)
download_intrusion_detection_dataset(project_root)
def execute_shell_command(cmd, stdout_verification=None, b_stderr_verification=False):
result = subprocess.run(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print_to_std_and_file("Executed command: " + cmd)
print("STDOUT:")
print(result.stdout)
print("STDERR:")
print(result.stderr)
if stdout_verification is not None:
if stdout_verification not in result.stdout.decode():
print_to_std_and_file("Verification string " + stdout_verification + " not in stdout")
raise Exception
if b_stderr_verification:
if result.stderr != b'':
print_to_std_and_file("stderr is not empty: " + result.stderr)
raise Exception
def execute_shell_command_with_live_output(cmd):
"""
Executes shell command with live output of STD, so it would not be boring :)
:param cmd: the command to execute
:return:
"""
process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
print_to_std_and_file("Executed command: " + cmd)
while True:
output = process.stdout.readline()
if process.poll() is not None:
break
if output:
print(output.strip().decode('utf-8'))
rc = process.poll()
if rc != 0:
print_to_std_and_file("RC not 0 (RC=" + str(rc) + ") for command: " + cmd)
raise Exception
return rc
def edit_dockerignore():
"""
Edits the .dockerignore file so it will not ignore the experiments folder.
:return:
"""
with open('./.dockerignore', 'r') as f:
content = f.read()
content = content.replace("experiments", "experiments/test_results")
with open('./.dockerignore', 'w') as f:
f.write(content)
def revert_dockerignore_changes():
"""
Reverts changes made to the .dockerignore file by the edit_dockerignore() function.
:return:
"""
with open('./.dockerignore', 'r') as f:
content = f.read()
content = content.replace("experiments/test_results", "experiments")
with open('./.dockerignore', 'w') as f:
f.write(content)
def get_latest_test_folder(result_dir, filter_str):
"""
Returns the latest, most updated, test folder in result_dir that contains filter_str in its name.
If no such folder exists, returns None.
:param result_dir:
:param filter_str:
:return:
"""
paths = sorted(Path(result_dir).iterdir(), key=os.path.getmtime)
if len(paths) == 0:
return None
test_folders = [str(f) for f in paths if filter_str in str(f)]
if len(test_folders) == 0:
return None
return test_folders[-1].split('/')[-1]
def run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, file_to_verify_execution, args=None, estimated_runtimes=None):
"""
Receives a list of functions and runs a given experiment for each function in the list.
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:param functions: the functions included in the experiment (e.g. inner product, kld, etc.). Running an experiment for each function
:param test_name_prefix: the prefix of the experiment script
:param result_folder_prefix: the prefix of the experiment result folder
:param file_to_verify_execution: the existence of the file indicates if the experiment have been executed successfully already
:param args: if not None, this is a list of string, each of them is an argument to one function's experiment
:param estimated_runtimes: if not None, this is a list of floats, each of them is an estimated runtime in seconds of one function's experiment
:return:
"""
test_folders = []
for i, function in enumerate(functions):
test_folder = get_latest_test_folder(local_result_dir, result_folder_prefix + function)
if test_folder and os.path.isfile(local_result_dir + "/" + test_folder + "/" + file_to_verify_execution):
print_to_std_and_file("Found existing test folder for " + function + ": " + test_folder + ". Skipping.")
else:
cmd = docker_run_command_prefix + test_name_prefix + function + '.py'
if args:
cmd += ' ' + args[i]
if estimated_runtimes:
print_to_std_and_file('Experiment ' + test_name_prefix + function + '.py estimated runtime (on an Intel i9-7900X at 3.3GHz with 64GB RAM) is: ' + str(estimated_runtimes[i]) + ' seconds')
start = timer()
execute_shell_command_with_live_output(cmd)
end = timer()
print_to_std_and_file('Experiment ' + test_name_prefix + function + '.py took: ' + str(end - start) + ' seconds')
test_folder = get_latest_test_folder(local_result_dir, result_folder_prefix + function)
assert test_folder
assert os.path.isfile(local_result_dir + "/" + test_folder + "/" + file_to_verify_execution)
test_folders.append(test_folder)
return test_folders
def generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, plot_script):
"""
Generates figures from the experiment results.
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:param test_folders: the output folders of the experiment
:param plot_script: the script to run that generates the figures
:return:
"""
cmd = docker_run_command_prefix + 'visualization/' + plot_script + ' ' + docker_result_dir + " " + " ".join(test_folders)
execute_shell_command_with_live_output(cmd)
def run_error_communication_tradeoff_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix):
"""
Runs Error-Communication Tradeoff experiment and generates Figures 5 and 6 (Sec. 4.3 in the paper).
The result folders and figures could be found in local_result_dir (<project_root>/test_results)
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:return:
"""
print_to_std_and_file("Executing Error-Communication Tradeoff experiment")
test_name_prefix = "test_max_error_vs_communication_"
result_folder_prefix = "results_" + test_name_prefix
functions = ["inner_product", "quadratic", "dnn_intrusion_detection", "kld_air_quality"]
test_folders = run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "max_error_vs_communication.pdf", estimated_runtimes=[60, 40, 67000, 17400])
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_error_communication_tradeoff.py')
print_to_std_and_file("Successfully executed Error-Communication Tradeoff experiment")
def run_scalability_to_dimensionality_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix):
"""
Runs Scalability to Dimensionality experiment and generates Figure 7 (a) and more (Sec. 4.4 in the paper).
The result folders and figures could be found in local_result_dir (<project_root>/test_results)
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:return:
"""
print_to_std_and_file("Executing Scalability to Dimensionality experiment")
test_name_prefix = "test_dimension_impact_"
result_folder_prefix = "results_" + test_name_prefix
functions = ["inner_product", "kld_air_quality", "mlp"]
test_folders = run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "dimension_200/results.txt", estimated_runtimes=[60, 8300, 20600])
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_dimensions_stats.py')
print_to_std_and_file("Successfully executed Scalability to Dimensionality experiment")
def run_scalability_to_number_of_nodes_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix):
"""
Runs Scalability to Number of Nodes experiment and generates Figure 7 (b) (Sec. 4.4 in the paper).
The result folders and figures could be found in local_result_dir (<project_root>/test_results)
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:return:
"""
print_to_std_and_file("Executing Scalability to Number of Nodes experiment")
test_name_prefix = "test_num_nodes_impact_"
result_folder_prefix = "results_" + test_name_prefix
functions = ["inner_product", "mlp_40"]
test_folders = run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "num_nodes_vs_communication.pdf", estimated_runtimes=[250, 8700])
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_num_nodes_impact.py')
print_to_std_and_file("Successfully executed Scalability to Number of Nodes experiment")
def run_neighborhood_size_tuning_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix):
"""
Runs Neighborhood Size Tuning experiment and generates Figures 3, 8, 9 (Sec. 3.6 and 4.5 in the paper).
The result folders and figures could be found in local_result_dir (<project_root>/test_results)
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:return:
"""
print_to_std_and_file("Executing Neighborhood Size Tuning experiment")
test_name_prefix = "test_optimal_and_tuned_neighborhood_"
result_folder_prefix = "results_optimal_and_tuned_neighborhood_"
functions = ["rozenbrock", "mlp_2"]
test_folders = run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "neighborhood_size_error_bound_connection_avg.pdf", estimated_runtimes=[15100, 24600])
test_name_prefix = "test_neighborhood_impact_on_communication_"
result_folder_prefix = "results_comm_neighborhood_"
args = [docker_result_dir + "/" + f for f in test_folders]
test_folders += run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "neighborhood_impact_on_communication_error_bound_connection.pdf", args, estimated_runtimes=[1900, 11300])
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_neighborhood_impact.py')
print_to_std_and_file("Successfully executed Neighborhood Size Tuning experiment")
def run_ablation_study_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix):
"""
Runs Ablation Study experiment and generates Figures 10 (Sec. 4.6 in the paper).
The result folders and figures could be found in local_result_dir (<project_root>/test_results)
:param local_result_dir: path of the folder where the experiment's output folder is mapped to on local computer
:param docker_result_dir: path of the folder where the experiment's output folder is created on the docker
:param docker_run_command_prefix: docker run command prefix, which includes mapping of result folder and docker image name
:return:
"""
print_to_std_and_file("Executing Ablation Study experiment")
test_name_prefix = "test_ablation_study_"
result_folder_prefix = "results_ablation_study_"
functions = ["quadratic_inverse", "mlp_2"]
test_folders = run_experiment(local_result_dir, docker_run_command_prefix, functions, test_name_prefix, result_folder_prefix, "results.txt", estimated_runtimes=[20, 200])
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_monitoring_stats_ablation_study.py')
print_to_std_and_file("Successfully executed Ablation Study experiment")
def run_aws_experiment(node_type, coordinator_aws_instance_type, local_result_dir, b_centralized, estimated_runtime):
if b_centralized:
node_name = 'centralization_' + node_type
b_centralization = ' --centralized'
else:
node_name = node_type
b_centralization = ''
test_folder = get_latest_test_folder(local_result_dir, "max_error_vs_comm_" + node_name + "_aws")
if test_folder:
print_to_std_and_file("Found existing local AWS test folder for " + node_name + ": " + test_folder + ". Skipping.")
return
print_to_std_and_file('Distributed experiment ' + node_name + ' estimated runtime is: ' + str(estimated_runtime) + ' seconds')
start = timer()
# Run the AWS deploy script inside a docker container to avoid the need to install boto3, etc. Use --block flag so the docker waits until it finds the results in AWS S3.
execute_shell_command_with_live_output('sudo docker run --rm automon_aws_experiment python /app/aws_experiments/deploy_aws_experiment.py --node_type ' + node_type + ' --coordinator_aws_instance_type ' + coordinator_aws_instance_type + ' --block' + b_centralization)
# Collect the results from S3
test_timestamp = datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
test_folder = os.path.join(local_result_dir, "max_error_vs_comm_" + node_name + "_aws_" + test_timestamp)
os.makedirs(test_folder)
# This command requires AWS cli installed and configured
execute_shell_command_with_live_output('aws s3 cp s3://automon-experiment-results/max_error_vs_comm_' + node_name + '_aws ' + test_folder + ' --recursive')
end = timer()
print_to_std_and_file('Distributed experiment ' + node_name + ' took: ' + str(end - start) + ' seconds')
def generate_aws_figures(local_result_dir, docker_result_dir, docker_run_command_prefix):
# Plot figures
test_folders = [get_latest_test_folder(local_result_dir, "results_test_max_error_vs_communication_inner_product"),
get_latest_test_folder(local_result_dir, "results_test_max_error_vs_communication_quadratic"),
get_latest_test_folder(local_result_dir, "results_test_max_error_vs_communication_kld_air_quality"),
get_latest_test_folder(local_result_dir, "results_test_max_error_vs_communication_dnn_intrusion_detection"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_inner_product_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_quadratic_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_kld_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_dnn_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_centralization_inner_product_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_centralization_quadratic_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_centralization_kld_aws"),
get_latest_test_folder(local_result_dir, "max_error_vs_comm_centralization_dnn_aws")]
for test_folder in test_folders:
# Make sure that all test folders were found
assert test_folder
generate_figures(docker_result_dir, docker_run_command_prefix, test_folders, 'plot_aws_stats.py')
def install_aws_cli():
try:
execute_shell_command('aws --version', stdout_verification="aws-cli")
except:
pass
else:
# AWS cli is already installed
return
execute_shell_command_with_live_output('sudo apt-get -y install unzip')
execute_shell_command_with_live_output('curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"')
execute_shell_command_with_live_output('unzip awscliv2.zip')
execute_shell_command_with_live_output('sudo ./aws/install')
execute_shell_command('aws --version', stdout_verification="aws-cli")
def configure_aws_cli(region):
# Get the access key and secret access key from the new_user_credentials.csv file, without using pandas, boto3, etc.
with open('./aws_experiments/new_user_credentials.csv', 'r') as f:
credentials = f.read()
access_key_id = credentials.split('link')[1].split(',')[2]
secret_access_key = credentials.split('link')[1].split(',')[3]
execute_shell_command_with_live_output('aws configure set aws_access_key_id ' + access_key_id)
execute_shell_command_with_live_output('aws configure set aws_secret_access_key ' + secret_access_key)
execute_shell_command_with_live_output('aws configure set region ' + region)
execute_shell_command_with_live_output('aws configure set output json')
execute_shell_command('aws configure get region', stdout_verification=region) # Verify configuration worked
def build_and_push_docker_image_to_aws_ecr(region):
execute_shell_command_with_live_output('sudo docker build -f aws_experiments/awstest.Dockerfile -t automon_aws_experiment .')
print_to_std_and_file("Successfully built docker image for the AWS experiments")
# Get the AWS account id from the new_user_credentials.csv file, without using pandas, boto3, etc.
with open('./aws_experiments/new_user_credentials.csv', 'r') as f:
credentials = f.read()
account_id = credentials.split('https://')[1].split('.signin')[0]
# These two commands require AWS cli installed and configured
execute_shell_command_with_live_output('aws ecr describe-repositories --region ' + region + ' --repository-names automon || aws ecr create-repository --repository-name automon')
execute_shell_command_with_live_output('aws ecr get-login-password --region ' + region + ' | sudo docker login --username AWS --password-stdin ' + account_id + '.dkr.ecr.' + region + '.amazonaws.com/automon')
print_to_std_and_file("Successfully obtained ECR login password")
execute_shell_command_with_live_output('sudo docker tag automon_aws_experiment ' + account_id + '.dkr.ecr.' + region + '.amazonaws.com/automon')
print_to_std_and_file("Successfully tagged docker image")
execute_shell_command_with_live_output('sudo docker push ' + account_id + '.dkr.ecr.' + region + '.amazonaws.com/automon')
print_to_std_and_file("Successfully pushed docker image to ECR")
def run_aws_experiments(local_result_dir, docker_result_dir, docker_run_command_prefix):
nodes_region = 'us-east-2'
install_aws_cli()
configure_aws_cli(nodes_region)
build_and_push_docker_image_to_aws_ecr(nodes_region)
# Run AutoMon distributed experiments
run_aws_experiment('inner_product', 'ec2', local_result_dir, b_centralized=False, estimated_runtime=3500) # All 10 experiments (for every error bound) run in parallel and take about the same time, ~1 hour
run_aws_experiment('quadratic', 'ec2', local_result_dir, b_centralized=False, estimated_runtime=3500) # All 8 experiments (for every error bound) run in parallel and take about the same time, ~1 hour
run_aws_experiment('kld', 'ec2', local_result_dir, b_centralized=False, estimated_runtime=29900) # All 8 experiments (for every error bound) run in parallel and take about the same time, ~8.5 hours
run_aws_experiment('dnn', 'ec2', local_result_dir, b_centralized=False, estimated_runtime=150900) # All 6 experiments (for every error bound) run in parallel and the reported time is the max between them, ~1 day and 18 hours
# Run the distributed centralized experiments. Each run in a single experiment (no error-bound in centralization)
run_aws_experiment('inner_product', 'ec2', local_result_dir, b_centralized=True, estimated_runtime=600) # ~10 minutes
run_aws_experiment('quadratic', 'ec2', local_result_dir, b_centralized=True, estimated_runtime=600) # ~10 minutes
run_aws_experiment('kld', 'ec2', local_result_dir, b_centralized=True, estimated_runtime=900) # ~15 minutes
run_aws_experiment('dnn', 'ec2', local_result_dir, b_centralized=True, estimated_runtime=900) # ~15 minutes
# Plot figures
generate_aws_figures(local_result_dir, docker_result_dir, docker_run_command_prefix)
def compile_reproduced_main_pdf():
"""
Installs TexLive for the compilation of the paper's Latex source files and compiles the paper from Latex source
files with the new figures.
:param
:return:
"""
execute_shell_command_with_live_output('sudo apt-get install -y texlive-latex-base texlive-fonts-recommended texlive-fonts-extra texlive-latex-extra texlive-science')
# Copy figures from local_result_dir to project_root/docs/latex_src/figures and report of missing figures
figure_list = {"Figure 3": "impact_of_neighborhood_on_violations_three_error_bounds.pdf",
"Figure 4": "function_values_and_error_bound_around_it.pdf",
"Figure 5": "max_error_vs_communication.pdf",
"Figure 6": "percent_error_kld_and_dnn.pdf",
"Figure 7 (a)": "dimension_communication.pdf",
"Figure 7 (b)": "num_nodes_vs_communication.pdf",
"Figure 8": "neighborhood_impact_on_communication_error_bound_connection.pdf",
"Figure 9 (a)": "monitoring_stats_quadratic_inverse.pdf",
"Figure 9 (b)": "monitoring_stats_barchart_mlp_2.pdf",
"Figure 10 (top)": "max_error_vs_transfer_volume.pdf",
"Figure 10 (bottom)": "communication_automon_vs_network.pdf"}
for figure_name, figure_file in figure_list.items():
if not os.path.isfile(local_result_dir + "/" + figure_file):
print_to_std_and_file("Note: " + figure_name + " (" + figure_file + ") wasn't found in " + local_result_dir + ". Using the original figure from the paper.")
else:
shutil.copyfile(local_result_dir + "/" + figure_file, project_root + "/docs/latex_src/figures/" + figure_file)
print_to_std_and_file("Replaced " + figure_name + " (" + figure_file + ") with the reproduced one.")
os.chdir(project_root + "/docs/latex_src")
execute_shell_command('pdflatex --interaction=nonstopmode main.tex')
execute_shell_command('bibtex main.aux')
execute_shell_command('pdflatex --interaction=nonstopmode main.tex')
execute_shell_command('pdflatex --interaction=nonstopmode main.tex', stdout_verification="main.pdf")
shutil.copyfile("main.pdf", reproduced_main_pfd_file)
print_to_std_and_file("The reproduced paper, containing the new figures based on these experiments, is in " + reproduced_main_pfd_file)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Reproduce AutoMon's experiments. The script downloads the code from https://github.com/hsivan/automon, downloads the external datasets, and runs the experiments.\n"
"Run this script on Linux (Ubuntu 18.04 or later).\n\n"
"Requirements:\n"
"(1) Python >= 3.8\n"
"(2) Docker engine (see https://docs.docker.com/engine/install/ubuntu, and make sure running 'sudo docker run hello-world' works as expected)\n\n"
"The script uses only libraries from the Python standard library, which prevents the need to install external packages.", formatter_class=RawTextHelpFormatter)
parser.add_argument("--dd", dest="b_download_dataset", help="if --dd is specified, the script only downloads the repository and external datasets and exits (without running any experiments)", action='store_true')
parser.add_argument("--aws", dest="b_aws_experiments", help="if --aws is specified, also run AWS experiments (in addition to the simulation experiments).\n"
"Note: in case the script is called with the --aws options it installs AWS cli and configures it. After running\n"
"the simulations, it runs the distributed experiments on AWS.\n"
"It requires the user to have an AWS account; after opening the account, the user must create AWS IAM user with\n"
"AdministratorAccess permissions. The full steps to create the IAM user:\n"
"\t 1. In https://console.aws.amazon.com/iam/ select 'Users' on the left and press 'Add users'. \n"
"\t 2. Provide user name, e.g. automon, and mark the 'Access key' checkbox and press 'Next: Permissions'. \n"
"\t 3. Expand the 'Set permissions boundary' section and select 'Use a permissions boundary to control the maximum user permissions'. \n"
"\t Use the Filter to search for AdministratorAccess and then select AdministratorAccess from the list and press 'Next: Tags'. \n"
"\t 4. No need to add tags so press 'Next: Review'. \n"
"\t 5. In the next page press 'Create user'. \n"
"\t 6. Press 'Download.csv' to download the new_user_credentials.csv file. \n"
"\t 7. Download the project (if you haven't already) using the --dd flag, and place the new_user_credentials.csv file in \n"
"\t <project_root>/aws_experiments/new_user_credentials.csv. \n"
"After completing these steps, re-run this script.\n"
"Running the AWS experiments would cost a few hundred dollars!\n"
"After completion of the experiments and verification of the results, the user should clean AWS resources allocated by\n"
"this script by running the cleanup script <project_root>/aws_experiments/aws_cleanup.py.", action='store_true')
args = parser.parse_args()
log_file = os.path.abspath(__file__).replace("reproduce_experiments.py", "reproduce_experiments.log")
reproduced_main_pfd_file = os.path.abspath(__file__).replace("reproduce_experiments.py", "main.pdf")
print_to_std_and_file("======================== Reproduce AutoMon's Experiments ========================")
print("The script log is at", log_file)
verify_requirements()
project_root = download_repository()
download_external_datasets(project_root)
print_to_std_and_file("Downloaded external datasets")
if args.b_download_dataset:
sys.exit()
os.chdir(project_root)
if args.b_aws_experiments:
# Include the following only here, after the source code was downloaded
# Verify the new_user_credentials.csv exists
if not os.path.isfile('aws_experiments/new_user_credentials.csv'):
print_to_std_and_file("To run AWS experiments, you must have an AWS account. After opening the account, create AWS IAM user with "
"AdministratorAccess permissions and download the csv file new_user_credentials.csv that contains the key ID and the secret key. "
"Download the project (if you haven't already) using the --dd flag, and place the new_user_credentials.csv file "
"in " + project_root + "/aws_experiments/new_user_credentials.csv and re-run this script.\n"
"Note: AWS cli will be installed on your computer and will be configured!")
sys.exit(1)
try:
# Build docker image with .dockerignore which does not ignore the experiments folder
edit_dockerignore()
execute_shell_command_with_live_output('sudo docker build -f experiments/experiments.Dockerfile -t automon_experiment .')
print_to_std_and_file("Successfully built docker image for the experiments")
# Run the experiments
docker_result_dir = '/app/experiments/test_results'
local_result_dir = os.getcwd() + "/test_results"
try:
os.makedirs(local_result_dir)
except FileExistsError:
pass
docker_run_command_prefix = 'sudo docker run -v ' + local_result_dir + ':' + docker_result_dir + ' --rm automon_experiment python /app/experiments/'
print_to_std_and_file("Experiment results are written to: " + local_result_dir)
# Plot Figure 4
generate_figures(docker_result_dir, docker_run_command_prefix, [], 'plot_function_values_and_error_bound.py')
# Estimated total runtime: ~2 days and 3 hours
run_error_communication_tradeoff_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix) # Estimated runtime: ~24 hours
run_scalability_to_dimensionality_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix) # Estimated runtime: ~8 hours
run_scalability_to_number_of_nodes_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix) # Estimated runtime: ~2.5 hours
run_neighborhood_size_tuning_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix) # Estimated runtime: ~15 hours
run_ablation_study_experiment(local_result_dir, docker_result_dir, docker_run_command_prefix) # Estimated runtime: ~4 minutes
# Estimated total runtime: ~2 days and 5 hours
if args.b_aws_experiments:
run_aws_experiments(local_result_dir, docker_result_dir, docker_run_command_prefix) # See runtime estimation per experiment in run_aws_experiments()
# Build the paper from Latex source files with the new figures
compile_reproduced_main_pdf()
finally:
os.chdir(project_root)
revert_dockerignore_changes()