Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

This repository contains the code behind the paper "Power- and Fragmentation-aware Online Scheduling for GPU Datacenters" by Francesco Lettich (CNR-ISTI), Emanuele Carlini (CNR-ISTI), Franco Maria Nardini (CNR-ISTI), Raffaele Perego (CNR-ISTI), and Salvatore Trani (CNR-ISTI).

The article has been accepted and presented to the 25th IEEE International Symposium on Cluster, Cloud, and Internet Computing (CCGrid 2025) conference. The slides of the presentation given at the conference can be found here.

This repository started as a fork of the repository behind the seminal 2023 USENIX paper "Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent" from Weng, Qizhen, et al.; now, it includes the customizations and code behind our paper's contributions, which is focused on minimizing power consumption alongside GPU fragmentation.

More precisely, in this repository you will find:

our power-aware online scheduling policy, PWR, in the form of a Kubernetes scoring plugin. The core of the PWR plugin can be found in the Go source file pwr_score.go;
the power consumption telemetry feature added to Alibaba's open-simulator (this required to modify some of the simulator's source files);
Python and Bash scripts used for the paper's experimental evaluation, to ensure reproducibility.

How to compile the code

The code can be theoretically compiled on any platform. First ensure that Go is installed. Then:

go mod vendor installs the dependencies required to compile the code.

$ go mod vendor

make generates the compiled binary files in the bin directory.

$ make

How to reproduce our experimental evaluation's pipeline and results

The Python dependencies required to run the Python scripts behind our experimental evaluation are listed in the file requirements.txt. They can be installed by executing:

$ pip install -r requirements.txt

Then, to reproduce the experimental pipeline used in our paper, please follow these steps:

translate the production traces from CSV to YAML -- this is required to run the experiments with the simulator. To this end, read README under the data directory for more information.
execute the simulations conducted in the paper. To this end, read Section 1 from the README under the experiments directory for more information. Please, be aware that the simulations can take a lot of time, depending on the amount of resources at your disposal.
extract and plot the simulations' results. To this end, read Section 2 from the README under the experiments directory for more information.

Cite us

Please cite our CCGrid 2025 article if you have found our contributions useful, or you have used them within your work.

@inproceedings{lettich2025pwr,
  author={Lettich, Francesco and Carlini, Emanuele and Nardini, Franco Maria and Perego, Raffaele and Trani, Salvatore},
  booktitle={2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)}, 
  title={Power- and Fragmentation-Aware Online Scheduling for {GPU} Datacenters}, 
  year={2025},
  volume={},
  number={},
  pages={43-52},
  doi={10.1109/CCGRID64434.2025.00015},
  publisher={IEEE Computer Society}}

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
CCGrid_2025		CCGrid_2025
cmd		cmd
data		data
example		example
experiments		experiments
pkg		pkg
scripts		scripts
vendor		vendor
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

How to compile the code

How to reproduce our experimental evaluation's pipeline and results

Cite us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

How to compile the code

How to reproduce our experimental evaluation's pipeline and results

Cite us

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages