|
| 1 | +# Artifact for SC '21 |
| 2 | + |
| 3 | + |
| 4 | +This repository contains the artifact for the SC '21 paper "*Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters*". It includes following four parts: |
| 5 | + |
| 6 | ++ `enviornment`: The experimental environment in ***Appendix: Artifact Description/Artifact Evaluation***. |
| 7 | + |
| 8 | ++ `data`: Helios traces download from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). |
| 9 | + |
| 10 | ++ `analysis`: It contains scripts for analyzing traces. |
| 11 | + |
| 12 | ++ `framework`: It contains `QSSF Service` and `CES Service` scripts |
| 13 | + |
| 14 | + |
| 15 | + |
| 16 | +> **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.** |
| 17 | +
|
| 18 | +## Detailed Introduction |
| 19 | + |
| 20 | +### `enviornment` |
| 21 | +Provide details on the experimental environment as shown in ***Appendix: Artifact Description/Artifact Evaluation***. |
| 22 | + |
| 23 | ++ `collect_environment.sh`: Gather execution environment information for GPU compute node and analysis platform. |
| 24 | + |
| 25 | ++ `env_analysis_platform`: Execution environment information for trace analysis platform. |
| 26 | + |
| 27 | ++ `env_datacenter_node`: Execution environment information for GPU compute node in our datacenter (from Volta Cluster). |
| 28 | + |
| 29 | ++ ***Summary*** |
| 30 | + |
| 31 | + | | Analysis Platform | Datacenter Node | |
| 32 | + | ------- | ------------------- | ------------------------ | |
| 33 | + | System | Ubuntu 20.04 LTS | CentOS 7.4 | |
| 34 | + | CPU | Intel Core i9-10900 | 2 x Intel Xeon Gold 6146 | |
| 35 | + | Memory | 32GB DDR4 | 376GB DDR4 | |
| 36 | + | GPU | GeForce RTX 2080 Ti | 8 x Tesla V100-SXM2 | |
| 37 | + | Network | Ethernet | InfiniBand EDR | |
| 38 | + |
| 39 | +### `data` |
| 40 | +Initially, this folder is ***NOT exist***. You need to download and unzip the dataset from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). After that, this folder structure should be: |
| 41 | + |
| 42 | + |
| 43 | +``` |
| 44 | +📦data |
| 45 | + ┣ 📂Earth |
| 46 | + ┃ ┣ 📜cluster_gpu_number.csv |
| 47 | + ┃ ┗ 📜cluster_log.csv |
| 48 | + ┣ 📂Saturn |
| 49 | + ┃ ┣ 📜cluster_gpu_number.csv |
| 50 | + ┃ ┗ 📜cluster_log.csv |
| 51 | + ┣ 📂Uranus |
| 52 | + ┃ ┣ 📜cluster_gpu_number.csv |
| 53 | + ┃ ┗ 📜cluster_log.csv |
| 54 | + ┗ 📂Venus |
| 55 | + ┃ ┣ 📜cluster_gpu_number.csv |
| 56 | + ┃ ┗ 📜cluster_log.csv |
| 57 | +``` |
| 58 | + |
| 59 | +> **Note that only the `Venus` trace is public available now.** |
| 60 | +
|
| 61 | + |
| 62 | +### `analysis` |
| 63 | +Contains parsing and plotting code to analyze traces. |
| 64 | + |
| 65 | ++ **compare with Philly trace**: Figure 1: Comparisons of job characteristics between Helios and Philly. |
| 66 | + |
| 67 | ++ **cluster characterization**: Figure 2: Daily pattern of the cluster usage in Helios. |
| 68 | + |
| 69 | + Figure 3: Monthly trends of cluster activities in Helios. |
| 70 | + |
| 71 | + Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size). |
| 72 | + |
| 73 | ++ **job characterization**: Figure 5: CDF of GPU (a) and CPU (b) job duration. |
| 74 | + |
| 75 | + Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b). |
| 76 | + |
| 77 | + Figure 7: Distribution of jobs by their final statuses. |
| 78 | + |
| 79 | + |
| 80 | + |
| 81 | ++ **user characterization**: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time. |
| 82 | + |
| 83 | + Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios. |
| 84 | + |
| 85 | + |
| 86 | + |
| 87 | +### `framework` |
| 88 | +An prediction-based GPU resource management framework. |
| 89 | + |
| 90 | +This folder contains `QSSF Service` and `CES Service` scripts and related data. |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | +## Quick Start |
| 95 | +These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform). |
| 96 | + |
| 97 | +Here are the ***step-by-step*** instructions for artifact. |
| 98 | +### Preparing |
| 99 | + |
| 100 | +1. Download Helios artifact and data repository. |
| 101 | + ```bash |
| 102 | + git clone git@github.com:S-Lab-System-Group/HeliosArtifact.git |
| 103 | + cd HeliosArtifact |
| 104 | + |
| 105 | + git clone git@github.com:S-Lab-System-Group/HeliosData.git |
| 106 | + mv ./HeliosData/data ./ |
| 107 | + ``` |
| 108 | + |
| 109 | +2. Check software dependencies: |
| 110 | + |
| 111 | + For the `analysis` part, JupyterLab / JupyterNotebook is needed. |
| 112 | + |
| 113 | + For the other python libraries used in this project, you can check `requirements.txt`. |
| 114 | + |
| 115 | + |
| 116 | +### Reproducing `analysis` |
| 117 | + |
| 118 | +3. Prepare and parse the trace files for analyzing. |
| 119 | + ```bash |
| 120 | + cd analysis |
| 121 | + python ./trace_parser.py --cluster-list 'Venus' |
| 122 | + ``` |
| 123 | +4. After generating all required data, you can analyze traces through `.ipynb` files within 4 sub-folders of `analysis`:**1_compare with Philly trace**, **2_cluster characterization**, **3_job characterization**, **4_user characterization**. |
| 124 | + |
| 125 | + These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper. |
| 126 | + |
| 127 | +> **Note that only the `Venus` trace is public available now. Thus, some generated figures are incomplete comparing with the paper version.** |
| 128 | + |
| 129 | + |
| 130 | +### Reproducing `framework` |
| 131 | + |
| 132 | + |
| 133 | +#### `QSSF Service` |
| 134 | + |
| 135 | +5. Before executing the simulation of QSSF service, data preparation is needed. |
| 136 | + |
| 137 | + It generates VC configuration and job trace for each cluster. |
| 138 | + |
| 139 | + ```bash |
| 140 | + cd framework/QSSF\ Service/data |
| 141 | + bash prepare_data.sh |
| 142 | + ``` |
| 143 | + |
| 144 | +6. Then, you can run all scheduling policies on **Philly** trace using `sweep` mode, as below: |
| 145 | + |
| 146 | + ```bash |
| 147 | + cd .. |
| 148 | + python simulator.py -e='Philly' -t='./data/Philly' --sweep |
| 149 | + ``` |
| 150 | + |
| 151 | + See `run.sh` for more usage examples on **Helios**. Note that since we do not release job name information, the `estimator` and `qssf policy` are not available for **Helios**. |
| 152 | + |
| 153 | + |
| 154 | + |
| 155 | +7. After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately. |
| 156 | + |
| 157 | +8. Besides, we provide simulation analysis and plot script in `plot`. |
| 158 | + |
| 159 | + You can generate Figure 13 in the paper through this script. |
| 160 | + |
| 161 | +#### `CES Service` |
| 162 | + |
| 163 | +9. Run CES simulation on **Helios**: |
| 164 | + |
| 165 | + ```bash |
| 166 | + cd framework/CES\ Service |
| 167 | + python CES_Helios.py |
| 168 | + ``` |
| 169 | + |
| 170 | + You can specify different cluster in the script and adjust the different configurations of CES service by the `hyperparameter` function. |
| 171 | + |
| 172 | + |
| 173 | +10. Similarly, run CES simulation on **Philly**: |
| 174 | + |
| 175 | + ```bash |
| 176 | + python CES_Philly.py |
| 177 | + ``` |
| 178 | + |
| 179 | +11. From the code output and generated figures `helios_ces` (Figure 14) & `philly_ces` (Figure 15), we can analyze the CES service performance in detail. |
0 commit comments