Deep Learning Training Service v1.6.0

Anbang-Hu released this 27 May 21:42

· 2 commits to v1.6 since this release

7556f34

Job Manager

Use username to run inference worker command
Support preempting inference jobs
Support preempting running preemptible jobs

Restful API

Override GPU type in job submission to avoid incorrect resource accounting by GPU type

Monitoring

NVSM health metrics for DGX-2
Add Prometheus aggregate rules for federation scrape
Expose health and performance metrics in Lustre
Remove data retirement in job-exporter metrics collection to avoid data missing
Expose Infiniband metrics
Distinguish metrics from preemptible jobs
GPU hours at cluster, VC, user, and job level
NFS storage usage by user
Monitor job pod phase
Add a centralized email sender

Dashboard

Storage tab in cluster status
End-to-end test in browser

Deployment

Lustre integration in cloud init deployment pipeline
Map old configs to cloud init format

Insight

Provide insight for running GPU jobs at backend

Assets 2