Skip to content

Deep Learning Training Service v1.6.0

Choose a tag to compare

@Anbang-Hu Anbang-Hu released this 27 May 21:42
· 2 commits to v1.6 since this release
7556f34

Job Manager

  • Use username to run inference worker command
  • Support preempting inference jobs
  • Support preempting running preemptible jobs

Restful API

  • Override GPU type in job submission to avoid incorrect resource accounting by GPU type

Monitoring

  • NVSM health metrics for DGX-2
  • Add Prometheus aggregate rules for federation scrape
  • Expose health and performance metrics in Lustre
  • Remove data retirement in job-exporter metrics collection to avoid data missing
  • Expose Infiniband metrics
  • Distinguish metrics from preemptible jobs
  • GPU hours at cluster, VC, user, and job level
  • NFS storage usage by user
  • Monitor job pod phase
  • Add a centralized email sender

Dashboard

  • Storage tab in cluster status
  • End-to-end test in browser

Deployment

  • Lustre integration in cloud init deployment pipeline
  • Map old configs to cloud init format

Insight

  • Provide insight for running GPU jobs at backend