(About 55 min)
(About 30 min)
If you are provided with an AWS IAM account & pre-built binaries
- If you just want to review figures & raw experimental data, see cluster-config-access-results-only.
- If you also want to reproduce all results from the beginning, see cluster-config-with-ami for setting up a cluster.
If you are not provided with an AWS account or you want to build everything from scratch, see cluster-config.
(About 15 min)
After logging in to the configured cluster, chdir to the current directory in the hoplite repo.
In the current directory, run
./parameter-server/run_async_ps_tests.shAfter the script completes, results are saved under ps-log.
To visualize the results, run
python plot_async_ps_results.pyThis generates 2 PDF files: async_training_8.pdf corresponds to Figure 9(a), and async_training_16.pdf corresponds to Figure 9(b).
You can download PDF files to your local machine using Ray cluster utils, for example:
ray rsync-down cluster.yaml /home/ubuntu/efs/hoplite/app/parameter-server/async_training_8.pdf .(About 10 min)
After logging in to the configured cluster, chdir to the current directory in the hoplite repo.
In the current directory, run
./run_async_ps_fault_tolerance.shThe script generates ray_asgd_fault_tolerance.json and hoplite_asgd_fault_tolerance.json after running.
Run python analyze_fault_tolerance.py to compare the failure detection latency (see section 5.5 in the paper).
The initial run will be extremely slow on AWS due to python generating caching files etc (about 5 min). This is totally normal.