Skip to content

Commit 6345d9f

Browse files
committed
initial commit
0 parents  commit 6345d9f

File tree

78 files changed

+3048770
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

78 files changed

+3048770
-0
lines changed

LICENSE.txt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2021-present NTU S-Lab
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Artifact for SC '21
2+
3+
4+
This repository contains the artifact for the SC '21 paper "*Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters*". It includes following four parts:
5+
6+
+ `enviornment`: The experimental environment in ***Appendix: Artifact Description/Artifact Evaluation***.
7+
8+
+ `data`: Helios traces download from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData).
9+
10+
+ `analysis`: It contains scripts for analyzing traces.
11+
12+
+ `framework`: It contains `QSSF Service` and `CES Service` scripts
13+
14+
15+
16+
> **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.**
17+
18+
## Detailed Introduction
19+
20+
### `enviornment`
21+
Provide details on the experimental environment as shown in ***Appendix: Artifact Description/Artifact Evaluation***.
22+
23+
+ `collect_environment.sh`: Gather execution environment information for GPU compute node and analysis platform.
24+
25+
+ `env_analysis_platform`: Execution environment information for trace analysis platform.
26+
27+
+ `env_datacenter_node`: Execution environment information for GPU compute node in our datacenter (from Volta Cluster).
28+
29+
+ ***Summary***
30+
31+
| | Analysis Platform | Datacenter Node |
32+
| ------- | ------------------- | ------------------------ |
33+
| System | Ubuntu 20.04 LTS | CentOS 7.4 |
34+
| CPU | Intel Core i9-10900 | 2 x Intel Xeon Gold 6146 |
35+
| Memory | 32GB DDR4 | 376GB DDR4 |
36+
| GPU | GeForce RTX 2080 Ti | 8 x Tesla V100-SXM2 |
37+
| Network | Ethernet | InfiniBand EDR |
38+
39+
### `data`
40+
Initially, this folder is ***NOT exist***. You need to download and unzip the dataset from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). After that, this folder structure should be:
41+
42+
43+
```
44+
📦data
45+
┣ 📂Earth
46+
┃ ┣ 📜cluster_gpu_number.csv
47+
┃ ┗ 📜cluster_log.csv
48+
┣ 📂Saturn
49+
┃ ┣ 📜cluster_gpu_number.csv
50+
┃ ┗ 📜cluster_log.csv
51+
┣ 📂Uranus
52+
┃ ┣ 📜cluster_gpu_number.csv
53+
┃ ┗ 📜cluster_log.csv
54+
┗ 📂Venus
55+
┃ ┣ 📜cluster_gpu_number.csv
56+
┃ ┗ 📜cluster_log.csv
57+
```
58+
59+
> **Note that only the `Venus` trace is public available now.**
60+
61+
62+
### `analysis`
63+
Contains parsing and plotting code to analyze traces.
64+
65+
+ **compare with Philly trace**: Figure 1: Comparisons of job characteristics between Helios and Philly.
66+
67+
+ **cluster characterization**: Figure 2: Daily pattern of the cluster usage in Helios.
68+
69+
Figure 3: Monthly trends of cluster activities in Helios.
70+
71+
Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size).
72+
73+
+ **job characterization**: Figure 5: CDF of GPU (a) and CPU (b) job duration.
74+
75+
Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b).
76+
77+
Figure 7: Distribution of jobs by their final statuses.
78+
79+
80+
81+
+ **user characterization**: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time.
82+
83+
Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios.
84+
85+
86+
87+
### `framework`
88+
An prediction-based GPU resource management framework.
89+
90+
This folder contains `QSSF Service` and `CES Service` scripts and related data.
91+
92+
93+
94+
## Quick Start
95+
These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform).
96+
97+
Here are the ***step-by-step*** instructions for artifact.
98+
### Preparing
99+
100+
1. Download Helios artifact and data repository.
101+
```bash
102+
git clone git@github.com:S-Lab-System-Group/HeliosArtifact.git
103+
cd HeliosArtifact
104+
105+
git clone git@github.com:S-Lab-System-Group/HeliosData.git
106+
mv ./HeliosData/data ./
107+
```
108+
109+
2. Check software dependencies:
110+
111+
For the `analysis` part, JupyterLab / JupyterNotebook is needed.
112+
113+
For the other python libraries used in this project, you can check `requirements.txt`.
114+
115+
116+
### Reproducing `analysis`
117+
118+
3. Prepare and parse the trace files for analyzing.
119+
```bash
120+
cd analysis
121+
python ./trace_parser.py --cluster-list 'Venus'
122+
```
123+
4. After generating all required data, you can analyze traces through `.ipynb` files within 4 sub-folders of `analysis`:**1_compare with Philly trace**, **2_cluster characterization**, **3_job characterization**, **4_user characterization**.
124+
125+
These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper.
126+
127+
> **Note that only the `Venus` trace is public available now. Thus, some generated figures are incomplete comparing with the paper version.**
128+
129+
130+
### Reproducing `framework`
131+
132+
133+
#### `QSSF Service`
134+
135+
5. Before executing the simulation of QSSF service, data preparation is needed.
136+
137+
It generates VC configuration and job trace for each cluster.
138+
139+
```bash
140+
cd framework/QSSF\ Service/data
141+
bash prepare_data.sh
142+
```
143+
144+
6. Then, you can run all scheduling policies on **Philly** trace using `sweep` mode, as below:
145+
146+
```bash
147+
cd ..
148+
python simulator.py -e='Philly' -t='./data/Philly' --sweep
149+
```
150+
151+
See `run.sh` for more usage examples on **Helios**. Note that since we do not release job name information, the `estimator` and `qssf policy` are not available for **Helios**.
152+
153+
154+
155+
7. After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately.
156+
157+
8. Besides, we provide simulation analysis and plot script in `plot`.
158+
159+
You can generate Figure 13 in the paper through this script.
160+
161+
#### `CES Service`
162+
163+
9. Run CES simulation on **Helios**:
164+
165+
```bash
166+
cd framework/CES\ Service
167+
python CES_Helios.py
168+
```
169+
170+
You can specify different cluster in the script and adjust the different configurations of CES service by the `hyperparameter` function.
171+
172+
173+
10. Similarly, run CES simulation on **Philly**:
174+
175+
```bash
176+
python CES_Philly.py
177+
```
178+
179+
11. From the code output and generated figures `helios_ces` (Figure 14) & `philly_ces` (Figure 15), we can analyze the CES service performance in detail.
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
+ `philly_trace.csv`
2+
3+
It is used to compare with our datacenter workloads.
4+
5+
We transfer the original Philly trace file into `.csv` format and select the same period of job logs as described in ["Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads"](https://www.usenix.org/system/files/atc19-jeon.pdf) (ATC’19).
6+
7+
The official public data can be download from [philly-traces](https://github.com/msr-fiddle/philly-traces).
8+
9+
+ `philly_trace_B.csv`
10+
11+
Failed jobs would be retried for a fixed number of times in Philly. If we process Philly trace by regarding each attempt as an individual job, we generate `philly_trace_B.csv`.
18.2 KB
Binary file not shown.

analysis/1_compare with Philly trace/compare_with_Philly_trace.ipynb

Lines changed: 227 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
id,gpu_num,job_num,cpu_job_num,gpu_job_num,avg_run_time_cpu,avg_run_time_gpu,avg_que_time_cpu,avg_que_time_gpu,avg_gpu_num,med_run_time_cpu,med_run_time_gpu,med_que_time_cpu,med_que_time_gpu,med_gpu_num,complete_rate,cancel_rate,fail_rate,complete_rate_cpu,cancel_rate_cpu,fail_rate_cpu,complete_rate_gpu,cancel_rate_gpu,fail_rate_gpu,complete_gpu_time,cancel_gpu_time,fail_gpu_time
2+
Venus,1022,246708,121405,125303,1649.859,13040.598,773.009,1253.288,6.736,21.0,204.0,0.0,0.0,1.0,0.69,0.186,0.124,0.86,0.097,0.043,0.526,0.272,0.202,6373315670.0,4920266598.0,1052030340.0
3+
Earth,997,872886,445738,427148,162.73,5130.609,3.281,319.483,2.101,1.0,234.0,0.0,0.0,1.0,0.812,0.069,0.119,0.885,0.005,0.11,0.735,0.136,0.129,5713038873.0,4793853520.0,932310362.0
4+
Saturn,2080,1753078,1054182,698896,619.062,5252.927,16.255,611.561,4.01,2.0,124.0,0.0,0.0,1.0,0.799,0.12,0.08,0.943,0.019,0.037,0.582,0.272,0.145,14375709962.0,10733792235.0,2542859066.0
5+
Uranus,2119,490309,161192,329117,1211.799,9163.725,119.704,1949.182,4.038,32.0,280.0,0.0,0.0,1.0,0.668,0.173,0.159,0.791,0.116,0.092,0.607,0.201,0.192,13532231815.0,10267196772.0,2761134934.0
6+
total,6220,3362981,1782517,1580464,628.759,6651.681,73.907,862.047,3.716,2.0,206.0,0.0,0.0,1.0,0.775,0.119,0.105,0.909,0.03,0.061,0.624,0.221,0.155,39994296320.0,30715109125.0,7288334702.0

0 commit comments

Comments
 (0)