Skip to content

Elastic Training Support for ReInvent Keynote 3

Latest

Choose a tag to compare

@mollyheamazon mollyheamazon released this 03 Dec 18:07
· 6 commits to main since this release
c64811d
  • Adding new command line arguments to the HyperPodTrainingOperator to support elastic training capabailities
    • --elastic-replica-increment-step, --max-node-count, --elastic-graceful-shutdown-timeout-in-seconds, --elastic-scaling-timeout-in-seconds, --elastic-scale-up-snooze-time-in-seconds, --elastic-replica-discrete-values
  • Enables dynamic scaling of compute resources during training operations