Releases: ml-energy/zeus
Zeus v0.15.0 & Zeus Daemon v0.4.0
Zeus daemon and corresponding Zeus client improvements
Zeus daemon is now more generic with three API groups that can be selectively enabled at Zeus daemon startup time:
gpu-read: GPU power and energy measurements, no privilege neededgpu-control: GPU frequency, power limit, etc. control, privilege neededcpu-read: CPU power and energy measurements, privilege needed
Plus, Zeus daemon now support JWT-based auth. You can generate a JWT token scoped to a a subset of API groups and set ZEUSD_TOKEN to call APIs in allowed API groups.
What's Changed
- [Docs] Improve measurement section and upgrade doc deps by @jaywonchung in #213
- [Zeusd] Monitor-only mode by @jaywonchung in #214
- [Zeusd] Handle potential server clock skews by @jaywonchung in #215
- [Zeusd] Selective API group enabling by @jaywonchung in #216
- [Docs] Tweaks in research overview and Zeusd README.md by @jaywonchung in #217
- [Zeusd] JWT token auth and
ZeusdClientby @jaywonchung in #218
Full Changelog: zeus-v0.14.0...zeus-v0.15.0
Zeus v0.14.0 & Zeus Daemon v0.3.0
Distributed Power Streaming
ZeusMonitor and PowerMonitor (well, every monitor we have) are local to a single machine. However, as ML workloads scale out, we frequently need multi-node power & energy monitoring and measurement.
We extended our Zeus daemon to stream real time power measurements to subscribed clients via SSE (Server-Sent Events). The PowerStreamingClient in the Zeus Python library can subscribe to multiple Zeus daemons across multiple nodes and aggregate power samples into a single stream. It doesn't have to be PowerStreamingClient; something as simple as curl -N http://node:port/gpu/stream_power works too.
What's Changed
- [Device] Skip calling APIs when there will be no change, AMDSMI unit fixes by @jaywonchung in #204
- [Docs] Add Troubleshooting section w/ multiprocessing pitfall by @jaywonchung in #205
- [Monitor] Make
PowerMonitorfail when monitoring process fails to spawn by @jaywonchung in #206 - [Feat] Global variable warning for monitors by @jaywonchung in #208
- Add newsletter embeds by @jaywonchung in #209
- [Feat] Zeusd power server and client by @jaywonchung in #210
- [Feat] Zeus daemon major refactor, bump to v0.3.0 by @jaywonchung in #211
- [CI] Switch from
pyrighttotyby @jaywonchung in #212
Full Changelog: zeus-v0.13.1...zeus-v0.14.0
Zeus v0.13.1
Better AMD GPU support
Got access to AMD MI210, M250X, and MI300X, so I smoothed out edge cases. Quick follow-up release from v0.13.0 for AMD GPU users.
What's Changed
- [AMD] Harden AMD GPU support by @jaywonchung in #203
Full Changelog: zeus-v0.13.0...zeus-v0.13.1
Zeus v0.13.0
Breaking Changes
The low-level device APIs are now all snake_case, instead of camelCase. It had to be done. It was an old mistake from following how pynvml methods were named like.
What's New
Various monitor usability improvements. Zeus now also follows logging best practices.
What's Changed
ZeusMonitorandPowerMonitorusability improvements by @jaywonchung in #196- Fix
stopand GC forPowerMonitorandTemperatureMonitorby @jaywonchung in #197 - [Device] Use snake case for device methods by @jaywonchung in #198
- [Chore] Switch formatter to
ruff foramtby @jaywonchung in #199 - [
show_env] Show device physical to application mapping by @jaywonchung in #200 - Follow logging best practices by @jaywonchung in #201
- [Device] Implement
get_power_management_limitby @jaywonchung in #202
Full Changelog: zeus-v0.12.3...zeus-v0.13.0
Zeus v0.12.3
New Features
CuPy synchronization support
It's not just deep learning our users are measuring energy for. There are other CUDA-based applications (e.g., cuDF) that are Python bindings of CUDA. Now, ZeusMonitor allows cupy as another mechanism for CPU-GPU synchronization at the boundary of measurement windows.
Temperature monitor
Temperature is a metric that also has a lot to do with power. It's a nice-to-have addition.
What's Changed
- Add CuPy sync support by @jaywonchung in #194
- [Feature] GPU Temperature Monitor by @jaywonchung in #195
Full Changelog: zeus-v0.12.2...zeus-v0.12.3
Zeus v0.12.2
This is a maintenance release focused on security.
What's Changed
- Add SECURITY.md by @jaywonchung in #188
- Add
scripts/check_licenses.shby @jaywonchung in #190 - Add
scripts/generate_sbom.shby @jaywonchung in #191 - Add permissions to GitHub Action workflows by @jaywonchung in #192
Full Changelog: zeus-v0.12.1...zeus-v0.12.2
Zeus v0.12.1
Change Highlights
New PowerMonitor
Power measurement over time was not a first-class feature, but now it is. The new PowerMonitor allows you to measure (1) GPU 1s windowed average power, (2) GPU instantaneous power, and (3) GPU memory windowed average power -- if supported by your GPU model -- over time, and export deduplicated power samples into a list of timestamps and power measurements.
Grace Hopper support
Zeus now supports measurements on Grace Hopper platforms. When you use the same Zeus APIs, it'll give you back the whole module's power and energy consumption (i.e., including the Grace CPU and the Hopper GPU). Support is still early stage, so please let us know if you bump into any rough edges.
uv
We're using uv in CI and local dev flow, and now uv.lock is in our codebase as well. Notably, uv has cut our CI time to literally half of what it used to be!
What's Changed
- [Feature] Grace Hopper support by @jaywonchung in #172
- ci: skip ci tests when only markdown files are changed in a push by @kitsiosk in #174
- Revert "ci: skip ci tests when only markdown files are changed in a push" by @jaywonchung in #175
- [UX] Improve
python -m zeus.show_envby @jaywonchung in #176 - [CI] Remove unnecessary tests that just raise warnings by @jaywonchung in #177
- [CI] Use
uvby @jaywonchung in #178 - [UX] Catch base errors in
python -m zeus.show_envby @jaywonchung in #179 - [CI] Use latest uv to avoid GitHub API call by @jaywonchung in #180
- [UX] Improve
python -m show_envfor CPU/RAPL by @jaywonchung in #182 - [Zeusd] Check
libnvidia-ml.so.1iflibnvidia-ml.sois not available by @jaywonchung in #183 - New
PowerMonitorby @jaywonchung in #184
New Contributors
Full Changelog: zeus-v0.12.0...zeus-v0.12.1
Zeus v0.12.0
Change Highlights
New SoC device measurement support!
We have a new device abstraction in zeus.device.soc. Measurements can be accessed from the soc field in ZeusMonitor measurement objects.
Apple Silicon
Zeus now provides energy measurement on Apple Silicon chips with component breakdowns like CPU, GPU, DRAM, and ANE (specifics depend on the underlying chip). This is done via a new child project called zeus-apple-silicon. Check out details in our documentation.
NVIDIA Jetson Platform
NVIDIA Jetson is an embedded platform for AI workloads. Zeus now supports energy measurement on Jetson platforms by reading off of its on-board power monitor. Check out details in our documentation.
Electricity price tracking
Via integration with the OpenEI API, Zeus now allows electricity price tracking with the EnergyCostMonitor class. Its API is essentially the same as ZeusMonitor (i.e., measurement windows).
What's Changed
- [Misc] Redirect stdout to
Noneonimport amdsmiby @jaywonchung in #154 - [CI] Upgrade QEMU image version to fix segfault in CI by @jaywonchung in #155
- [Feat] SoC Device Abstraction by @michahn01 in #160
- [Feat] Integrating Instruction Profiler in PFO.server.scheduler by @DdIiVvYyAaMm in #158
- [Fix] Update
amdsmiexception handling by @michahn01 in #165 - [Docs] Code Block Fix by @DdIiVvYyAaMm in #166
- [CI] Fix Pyright private import errors, upgrade actions by @jaywonchung in #169
- [Feat] Update SoC device common by @michahn01 in #168
- [Feat] Created price.py to incorporate OpenEI API integration. by @vishwa-11 in #162
- [Feat] Apple Silicon Integration by @michahn01 in #170
- [Feat] Jetson platform measurement support by @jxunn in #167
- [Misc] Update project news by @jaywonchung in #171
New Contributors
- @DdIiVvYyAaMm made their first contribution in #158
- @vishwa-11 made their first contribution in #162
- @jxunn made their first contribution in #167
Full Changelog: zeus-v0.11.0...zeus-v0.12.0
Zeus Daemon v0.2.0
Change Highlights
CPU and DRAM energy measurements
Zeus daemon now also supports CPU and DRAM energy measurements with RAPL, which also requires root privileges just for measurement. Zeus daemon has also been integrated into the Zeus Python library, so as long as you have the daemon deployed and you set the ZEUSD_SOCK_PATH environment variable, you'll be all set!
What's Changed
- [Feat] Implement CPU and DRAM monitoring for
zeusdby @wbjin in #137 - Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor by @michahn01 in #150
- Trace GPU ID in Zeusd GPU routes by @jaywonchung in #152
Zeus v0.11.0
Change Highlights
Renamed to zeus!
Until now we used zeus-ml because the name zeus was taken on PyPI, but now we're finally able to move to zeus:
pip install zeusPrometheus Metrics
Zeus power and energy measurements can now be exported as Prometheus metrics! We currently support three metrics:
- Energy consumption of a fixed code range (Histogram)
- Power draw over time (Gauge)
- Cumulative energy consumption over time (Counter)
We wrote up a detailed metric monitoring guide and integration examples.
AMD GPU enhancements
We created ROCm AMDSMI Python bindings (GitHub, PyPI) and integrated it with Zeus. Before this, users had to cd into their ROCm installation's AMDSMI distribution directory and run pip install, which isn't very convenient.
Our bindings are unofficial & community-maintained. But AMDSMI maintainers did take a look (ROCm/amdsmi#8).
Carbon Emission Estimations
The new zeus.monitor.carbon.CarbonEmissionMonitor takes in a carbon intensity provider (e.g., from ElectricityMaps) and provides an estimate for operational carbon emissions. The window-based API is essentially the same as ZeusMonitor.
Full Changelog
- [Misc] Reorganize Zeus NSDI 23 paper artifacts by @jaywonchung in #126
- [Docs] Add
BUILD_SOCIAL_CARDenv, skip social card build by default by @jaywonchung in #130 - [Feat]
CarbonIntensityProviderand ElectricityMaps implementation by @danielhou0515 in #129 - [Misc] Fix link in PLO example README by @jaywonchung in #136
- Fix typo in profiler script by @dkopczyk in #138
- [Feat]
amdsmibindings integration by @parthraut in #132 - Make sure to assign EmptyCPUs to cpus if there is a permission error by @wbjin in #139
- [Feat] Implement CPU and DRAM monitoring for
zeusdby @wbjin in #137 - [Fix] Fix tests failing due to deprecated
appargument in httpx client by @jaywonchung in #140 - Out of Bounds Power Limit in
GlobalPowerLimitOptimizerby @parthraut in #143 - [CI] Upgrade
actions/cacheto V4 by @jaywonchung in #144 - [Misc] Update Perseus paper link by @jaywonchung in #145
- [feat]
CarbonEmissionMonitorby @danielhou0515 in #148 - Update
zeusddependencies following dependabot suggestions by @jaywonchung in #149 - [Feat] Prometheus metric export by @sharonsyh in #134
- Pytorch Fully Sharded Data Parallel (FSDP) Integration by @parthraut in #147
- Rename package from
zeus-mltozeusby @jaywonchung in #151 - Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor by @michahn01 in #150
- Trace GPU ID in Zeusd GPU routes by @jaywonchung in #152
New Contributors
- @dkopczyk made their first contribution in #138
- @michahn01 made their first contribution in #150