S-Lab-System-Group
diff --git a/‎LICENSE.txt‎
Lines changed: 21 additions & 0 deletions b/‎LICENSE.txt‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 179 additions & 0 deletions b/‎README.md‎
Lines changed: 179 additions & 0 deletions
diff --git a/‎analysis/1_compare with Philly trace/README.md‎
Lines changed: 11 additions & 0 deletions b/‎analysis/1_compare with Philly trace/README.md‎
Lines changed: 11 additions & 0 deletions
diff --git a/‎analysis/1_compare with Philly trace/compare_philly.pdf‎
18.2 KB b/‎analysis/1_compare with Philly trace/compare_philly.pdf‎
18.2 KB
diff --git a/‎analysis/1_compare with Philly trace/compare_with_Philly_trace.ipynb‎
Lines changed: 227 additions & 0 deletions b/‎analysis/1_compare with Philly trace/compare_with_Philly_trace.ipynb‎
Lines changed: 227 additions & 0 deletions
diff --git a/‎analysis/1_compare with Philly trace/helios_cluster_summary.csv‎
Lines changed: 6 additions & 0 deletions b/‎analysis/1_compare with Philly trace/helios_cluster_summary.csv‎
Lines changed: 6 additions & 0 deletions
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2021-present NTU S-Lab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
@@ -0,0 +1,179 @@
+# Artifact for SC '21
+
+
+This repository contains the artifact for the SC '21 paper "*Characterization and Prediction of Deep LearningWorkloads in Large-Scale GPU Datacenters*". It includes following four parts:
+
++ `enviornment`: The experimental environment  in ***Appendix: Artifact Description/Artifact Evaluation***. 
+
++ `data`: Helios traces download from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData).
+
++ `analysis`: It contains scripts for analyzing traces.
+
++ `framework`: It contains `QSSF Service` and `CES Service` scripts
+
+
+
+> **Note that only the `Venus` trace is public available now. Other traces are being censored. We will release them as soon as possible.**
+
+## Detailed Introduction
+
+### `enviornment`
+Provide details on the experimental environment as shown in ***Appendix: Artifact Description/Artifact Evaluation***. 
+
++ `collect_environment.sh`: Gather execution environment information for GPU compute node and analysis platform.
+
++ `env_analysis_platform`: Execution environment information for trace analysis platform.
+
++ `env_datacenter_node`: Execution environment information for GPU compute node in our datacenter (from Volta Cluster).
+
++ ***Summary***
+
+    |         | Analysis Platform   | Datacenter Node          |
+    | ------- | ------------------- | ------------------------ |
+    | System  | Ubuntu 20.04 LTS    | CentOS 7.4               |
+    | CPU     | Intel Core i9-10900 | 2 x Intel Xeon Gold 6146 |
+    | Memory  | 32GB DDR4           | 376GB DDR4               |
+    | GPU     | GeForce RTX 2080 Ti | 8 x Tesla V100-SXM2      |
+    | Network | Ethernet            | InfiniBand EDR           |
+
+### `data`
+Initially, this folder is ***NOT exist***. You need to download and unzip the dataset from [HeliosData](https://github.com/S-Lab-System-Group/HeliosData). After that, this folder structure should be: 
+
+
+```
+📦data
+ ┣ 📂Earth
+ ┃ ┣ 📜cluster_gpu_number.csv
+ ┃ ┗ 📜cluster_log.csv
+ ┣ 📂Saturn
+ ┃ ┣ 📜cluster_gpu_number.csv
+ ┃ ┗ 📜cluster_log.csv
+ ┣ 📂Uranus
+ ┃ ┣ 📜cluster_gpu_number.csv
+ ┃ ┗ 📜cluster_log.csv
+ ┗ 📂Venus
+ ┃ ┣ 📜cluster_gpu_number.csv
+ ┃ ┗ 📜cluster_log.csv
+```
+
+> **Note that only the `Venus` trace is public available now.**
+
+
+### `analysis`
+Contains parsing and plotting code to analyze traces.
+
++ **compare with Philly trace**: Figure 1: Comparisons of job characteristics between Helios and Philly.
+
++ **cluster characterization**: Figure 2: Daily pattern of the cluster usage in Helios.
+    
+    Figure 3: Monthly trends of cluster activities in Helios.
+
+    Figure 4: The boxplot of utilization distributions for thetop 10 largest VCs of Earth in May (sorted by size).
+
++ **job characterization**: Figure 5: CDF of GPU (a) and CPU (b) job duration.
+
+    Figure 6: The CDFs of job sizes (in GPU number) with the number of jobs (a) and GPU time (b).
+
+    Figure 7: Distribution of jobs by their final statuses.
+
+
+
++ **user characterization**: Figure 8: The CDFs of users that consume the cluster resources in terms of (a) GPU Time (b) CPU Time.
+
+    Figure 9: (a) CDFs of users w.r.t. GPU job queuing delay. (b)Distributions of user GPU job completion ratios.
+
+
+
+### `framework`
+An prediction-based GPU resource management framework. 
+
+This folder contains `QSSF Service` and `CES Service` scripts and related data.
+
+
+
+## Quick Start
+These scripts have been tested on Ubuntu 20.04 with Python 3.8 (on the analysis platform).
+
+Here are the ***step-by-step*** instructions for artifact.
+### Preparing
+
+1.  Download Helios artifact and data repository.
+    ```bash
+    git clone git@github.com:S-Lab-System-Group/HeliosArtifact.git
+    cd HeliosArtifact
+
+    git clone git@github.com:S-Lab-System-Group/HeliosData.git
+    mv ./HeliosData/data ./
+    ```
+
+2. Check software dependencies:
+   
+   For the `analysis` part, JupyterLab / JupyterNotebook is needed.
+
+   For the other python libraries used in this project, you can check `requirements.txt`.
+
+
+### Reproducing `analysis`
+
+3.  Prepare and parse the trace files for analyzing.
+    ```bash
+    cd analysis
+    python ./trace_parser.py --cluster-list 'Venus'
+    ```
+4.  After generating all required data, you can analyze traces through `.ipynb` files within 4 sub-folders of `analysis`:**1_compare with Philly trace**, **2_cluster characterization**, **3_job characterization**, **4_user characterization**.
+
+    These Jupyter Notebook scripts are used for generating figures of the trace analysis part of the paper.
+
+> **Note that only the `Venus` trace is public available now. Thus, some generated figures are incomplete comparing with the paper version.**
+
+
+### Reproducing `framework`
+
+
+####  `QSSF Service`
+  
+5. Before executing the simulation of QSSF service, data preparation is needed.
+
+   It generates VC configuration and job trace for each cluster.
+
+    ```bash
+    cd framework/QSSF\ Service/data
+    bash prepare_data.sh 
+    ```
+
+6. Then, you can run all scheduling policies on **Philly** trace using `sweep` mode, as below:
+   
+    ```bash
+    cd ..
+    python simulator.py -e='Philly' -t='./data/Philly' --sweep 
+    ```
+
+   See `run.sh` for more usage examples on **Helios**. Note that since we do not release job name information, the `estimator` and `qssf policy` are not available for **Helios**.
+   
+
+
+7. After the program is executed, you can check the result in the `log` folder. The job log and time sequence of each VC are provided separately.
+
+8. Besides, we provide simulation analysis and plot script in `plot`.
+
+   You can generate Figure 13 in the paper through this script. 
+
+####  `CES Service`
+
+9. Run CES simulation on **Helios**:
+   
+    ```bash
+    cd framework/CES\ Service
+    python CES_Helios.py
+    ```
+
+    You can specify different cluster in the script and adjust the different configurations of CES service by the `hyperparameter` function.
+
+
+10. Similarly, run CES simulation on **Philly**:
+   
+    ```bash
+    python CES_Philly.py
+    ```
+
+11. From the code output and generated figures `helios_ces` (Figure 14) & `philly_ces` (Figure 15), we can analyze the CES service performance in detail.
@@ -0,0 +1,11 @@
++ `philly_trace.csv`  
+
+It is used to compare with our datacenter workloads.
+
+We transfer the original Philly trace file into `.csv` format and select the same period of job logs as described in ["Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads"](https://www.usenix.org/system/files/atc19-jeon.pdf) (ATC’19). 
+
+The official public data can be download from [philly-traces](https://github.com/msr-fiddle/philly-traces). 
+
++ `philly_trace_B.csv`  
+
+Failed jobs would be retried for a fixed number of times in Philly. If we process Philly trace by regarding each attempt as an individual job, we generate `philly_trace_B.csv`.
@@ -0,0 +1,6 @@
+id,gpu_num,job_num,cpu_job_num,gpu_job_num,avg_run_time_cpu,avg_run_time_gpu,avg_que_time_cpu,avg_que_time_gpu,avg_gpu_num,med_run_time_cpu,med_run_time_gpu,med_que_time_cpu,med_que_time_gpu,med_gpu_num,complete_rate,cancel_rate,fail_rate,complete_rate_cpu,cancel_rate_cpu,fail_rate_cpu,complete_rate_gpu,cancel_rate_gpu,fail_rate_gpu,complete_gpu_time,cancel_gpu_time,fail_gpu_time
+Venus,1022,246708,121405,125303,1649.859,13040.598,773.009,1253.288,6.736,21.0,204.0,0.0,0.0,1.0,0.69,0.186,0.124,0.86,0.097,0.043,0.526,0.272,0.202,6373315670.0,4920266598.0,1052030340.0
+Earth,997,872886,445738,427148,162.73,5130.609,3.281,319.483,2.101,1.0,234.0,0.0,0.0,1.0,0.812,0.069,0.119,0.885,0.005,0.11,0.735,0.136,0.129,5713038873.0,4793853520.0,932310362.0
+Saturn,2080,1753078,1054182,698896,619.062,5252.927,16.255,611.561,4.01,2.0,124.0,0.0,0.0,1.0,0.799,0.12,0.08,0.943,0.019,0.037,0.582,0.272,0.145,14375709962.0,10733792235.0,2542859066.0
+Uranus,2119,490309,161192,329117,1211.799,9163.725,119.704,1949.182,4.038,32.0,280.0,0.0,0.0,1.0,0.668,0.173,0.159,0.791,0.116,0.092,0.607,0.201,0.192,13532231815.0,10267196772.0,2761134934.0
+total,6220,3362981,1782517,1580464,628.759,6651.681,73.907,862.047,3.716,2.0,206.0,0.0,0.0,1.0,0.775,0.119,0.105,0.909,0.03,0.061,0.624,0.221,0.155,39994296320.0,30715109125.0,7288334702.0