0.19.1
Metrics
With this update, we've added more metrics that you can export to Prometheus. The new metrics allow tracking job CPU and system memory utilization, user and project usage stats, success/error rate, and more.
Runs
| Name | Type | Description | Examples |
|---|---|---|---|
dstack_run_count_total |
counter | The total number of runs | 537 |
dstack_run_count_terminated_total |
counter | The number of terminated runs | 118 |
dstack_run_count_failed_total |
counter | The number of failed runs | 27 |
dstack_run_count_done_total |
counter | The number of successful runs | 218 |
Run jobs
| Name | Type | Description | Examples |
|---|---|---|---|
dstack_job_cpu_count |
gauge | Job CPU count | 32.0 |
dstack_job_cpu_time_seconds_total |
counter | Total CPU time consumed by the job, seconds | 11.727975 |
dstack_job_memory_total_bytes |
gauge | Total memory allocated for the job, bytes | 4009754624.0 |
dstack_job_memory_usage_bytes |
gauge | Memory used by the job (including cache), bytes | 339017728.0 |
dstack_job_memory_working_set_bytes |
gauge | Memory used by the job (not including cache), bytes | 147251200.0 |
For more details on metrics, check Metrics
Major bugfixes
Fixed a bug introduced in 0.19.0 where the working directory in the container was incorrectly set by default to / instead of /workflow.
What's changed
- Fix trying fleet instance offers by @jvstme in #2443
- Add job system metrics, run metrics by @un-def in #2445
- Fix default working dir in containers by @jvstme in #2449
- [Examples] Update nccl-tests by @un-def in #2451
Full changelog: 0.19.0...0.19.1