Releases
v1.6.0
Deep Learning Training Service v1.6.0
Compare
Sorry, something went wrong.
No results found
Job Manager
Use username to run inference worker command
Support preempting inference jobs
Support preempting running preemptible jobs
Restful API
Override GPU type in job submission to avoid incorrect resource accounting by GPU type
Monitoring
NVSM health metrics for DGX-2
Add Prometheus aggregate rules for federation scrape
Expose health and performance metrics in Lustre
Remove data retirement in job-exporter metrics collection to avoid data missing
Expose Infiniband metrics
Distinguish metrics from preemptible jobs
GPU hours at cluster, VC, user, and job level
NFS storage usage by user
Monitor job pod phase
Add a centralized email sender
Dashboard
Storage tab in cluster status
End-to-end test in browser
Deployment
Lustre integration in cloud init deployment pipeline
Map old configs to cloud init format
Insight
Provide insight for running GPU jobs at backend
You can’t perform that action at this time.