This repository contains the code behind the paper "Power- and Fragmentation-aware Online Scheduling for GPU Datacenters" by Francesco Lettich (CNR-ISTI), Emanuele Carlini (CNR-ISTI), Franco Maria Nardini (CNR-ISTI), Raffaele Perego (CNR-ISTI), and Salvatore Trani (CNR-ISTI).
The article has been accepted and presented to the 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2025) conference. The slides of the presentation given at the conference can be found here.
This repository started as a fork of the repository behind the seminal 2023 USENIX paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent" from Weng, Qizhen, et al.; now, it includes the customizations and code behind our paper's contributions, which is focused on minimizing power consumption alongside GPU fragmentation.
More precisely, in this repository you will find:
- our power-aware online scheduling policy, PWR, in the form of a Kubernetes scoring plugin. The core of the PWR plugin can be found in the Go source file
pwr_score.go; - the power consumption telemetry feature added to Alibaba's open-simulator (this required to modify some of the simulator's source files);
- Python and Bash scripts used for the paper's experimental evaluation, to ensure reproducibility.
The code can be theoretically compiled on any platform. First ensure that Go is installed. Then:
go mod vendor installs the dependencies required to compile the code.
$ go mod vendormake generates the compiled binary files in the bin directory.
$ makeThe Python dependencies required to run the Python scripts behind our experimental evaluation are listed in the file requirements.txt. They can be installed by executing:
$ pip install -r requirements.txtThen, to reproduce the experimental pipeline used in our paper, please follow these steps:
- translate the production traces from CSV to YAML -- this is required to run the experiments with the simulator. To this end, read README under the
datadirectory for more information. - execute the simulations conducted in the paper. To this end, read Section 1 from the README under the
experimentsdirectory for more information. Please, be aware that the simulations can take a lot of time, depending on the amount of resources at your disposal. - extract and plot the simulations' results. To this end, read Section 2 from the README under the
experimentsdirectory for more information.
Please cite our CCGrid 2025 article if you have found our contributions useful, or you have used them within your work.
@inproceedings{lettich2025pwr,
author={Lettich, Francesco and Carlini, Emanuele and Nardini, Franco Maria and Perego, Raffaele and Trani, Salvatore},
booktitle={2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)},
title={Power- and Fragmentation-Aware Online Scheduling for {GPU} Datacenters},
year={2025},
volume={},
number={},
pages={43-52},
doi={10.1109/CCGRID64434.2025.00015},
publisher={IEEE Computer Society}}