Commit 79343ef
support deepspeed elastic (#1672)
* support elastic speed
* support elastic speed
* support elastic speed
* add doc for deepspeed elastic
* design doc
* feat: add support for additional arguments in DeepSpeed checkpoint save method
Add **kwargs parameter to the save method in AsyncCheckpointAgent to allow passing additional arguments, such as storage options, when saving checkpoints. This enhances flexibility for future extensions without breaking existing functionality.
* feat: remove model state skip logic in checkpoint saver
The condition to skip saving model states when `ckpt_config.write_model` is false has been removed. This change ensures that all specified states in `ckpt_config.paths` are saved regardless of the `write_model` flag, aligning the behavior with the configuration paths and preventing unintended omissions during checkpoint operations.
* The condition to skip saving model states when `ckpt_config.write_model` is false has been removed. This change ensures that all specified states in `ckpt_config.paths` are saved regardless of the `write_model` flag, aligning the behavior with the configuration paths and preventing unintended omissions during checkpoint operations.
Additionally, code formatting and style improvements were applied across multiple files, including:
- Added missing newlines between class definitions in `comm.py`
- Standardized spacing around operators and commas in `master_client.py` and `ckpt_saver.py`
- Updated Chinese comments to English in `ckpt_saver.py`
- Enhanced test coverage for graceful worker exit scenarios in `torch_agent_test.py`
* feat: improve universal checkpoint logic and code formatting
* feat: reduce checkpoint waiting timeout from 300 to 60 seconds
feat: reduce checkpoint waiting timeout from 300 to 60 seconds
* feat: fix shared memory handling and add rank parameter to checkpoint saver
- Set shared_memory to None after closing to prevent reuse of closed memory
- Add check for shared_memory.buf in SharedMemoryHandler.get() to avoid errors
- Add rank parameter to TempDirCheckpointSaver.__init__ for proper initialization
- Fix test formatting and remove unused variable in checkpoint saver tests
* feat: fix lint and ci
* feat: fix lint and ci
* feat: fix lint and ci
* feat(ckpt_saver): skip persistence when checkpoint config is missing
Add a guard clause in `persist_to_storage` to skip the persistence operation if the checkpoint config is `None` or has no paths. This prevents potential errors when the checkpoint configuration is incomplete or unavailable, ensuring the saver handles missing configurations gracefully.
* refactor: rename UCP-related classes and methods for clarity
- Rename `UCPReady` to `PreviousRoundCompleted` to better reflect its purpose of indicating previous rendezvous round completion
- Rename `UCPReadyRequest` to `PreviousRoundCompletedRequest` for consistency
- Update all related method names (`get_ucp_ready`, `set_ucp_ready`) to `get_previous_round_completed` and `set_previous_round_completed`
- Rename instance variable `ucp_ready` to `previous_round_completed` in `RendezvousManager`
- Improve documentation strings to clarify the purpose of tracking previous round completion status
* feat: replace DLROVER_UCP_RESTART with DLROVER_TRAINING_ELASTIC_MODE
- Update environment variable from DLROVER_UCP_RESTART to DLROVER_TRAINING_ELASTIC_MODE for better clarity and extensibility
- Change condition checks from `enable_ucp == "true"` to `elastic_mode == "ucp"` to support multiple elastic training modes
- Remove previous_round_completed logic from base RendezvousManager to simplify state management
- Introduce create_training_rdzv_manager factory function for flexible rendezvous manager creation
- Centralize elastic mode configuration through environment variable for consistent behavior across components
* refactor: replace previous round completion with rendezvous blocking
- Rename `PreviousRoundCompleted` message to `RdzvBlocked` with `blocked` boolean and `reason` string fields
- Remove `PreviousRoundCompletedRequest` message as it is no longer used
- Update `MasterClient` methods to use `set_rdzv_blocked` instead of `set_previous_round_completed` and remove `get_previous_round_completed`
- Modify `ElasticTrainingAgent` to call `set_rdzv_blocked` when UCP elastic mode is active
- Add `_rdzv_blocked` and `_rdzv_block_reason` state to `RendezvousManager` with corresponding setter and getter methods
- Update `_pre_rdzv_check_hook` to return the rendezvous blocked state and reason
This change simplifies the rendezvous state tracking by consolidating completion status into a blocking mechanism with an optional reason, improving clarity and flexibility for UCP elastic training scenarios.
* refactor: remove unnecessary blank lines and unused imports
- Remove extra blank lines in `comm.py`, `rdzv_manager.py`, and test files to improve code readability
- Remove unused import of `ElasticTrainingRendezvousManager` in `test_servicer.py` to clean up dependencies
* lint fix
* refactor seen_new_saving -> need_new_saving
* feat: add training elastic mode configuration for rendezvous manager
- Add DLROVER_TRAINING_ELASTIC_MODE environment variable constant
- Introduce training_elastic_mode default value and context attribute
- Add --training_elastic_mode argument to master CLI with default "base"
- Update rendezvous manager factory to use context instead of environment variable
- Pass training_elastic_mode from CLI args to job context in master initialization
The change centralizes configuration of the training elastic mode (base/ucp) through the master's command-line interface and global context, replacing the previous environment variable approach for better consistency and configurability.
* fix ci & unit test
* feat: add training elastic mode argument to job master
Add a new constant `trainingElasticModeArg` to the master controller and include it in the list of master arguments. This allows the job master to accept a `--training_elastic_mode` flag, enabling support for different training elasticity modes such as 'ucp'. The test has been updated to verify that the new argument is correctly passed to the master pod command.
---------
Co-authored-by: 01191421 <lijialin1014@cmbchina.com>
Co-authored-by: Tianyi Chen <chentianyi.cty@antfin.com>1 parent 8abc858 commit 79343ef
File tree
32 files changed
+1679
-56
lines changed- dlrover
- python
- common
- elastic_agent
- torch
- master
- elastic_training
- scaler
- scheduler
- tests
- trainer/torch
- flash_checkpoint
- docs
- deployment
- design
- figures
- tutorial
- examples/pytorch/mnist
- go/elasticjob/pkg/controllers
32 files changed
+1679
-56
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
223 | 223 | | |
224 | 224 | | |
225 | 225 | | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
226 | 234 | | |
227 | 235 | | |
228 | 236 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
315 | 315 | | |
316 | 316 | | |
317 | 317 | | |
| 318 | + | |
318 | 319 | | |
319 | 320 | | |
320 | 321 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
82 | 82 | | |
83 | 83 | | |
84 | 84 | | |
| 85 | + | |
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
| |||
146 | 147 | | |
147 | 148 | | |
148 | 149 | | |
| 150 | + | |
149 | 151 | | |
150 | 152 | | |
151 | 153 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
515 | 515 | | |
516 | 516 | | |
517 | 517 | | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
518 | 522 | | |
519 | 523 | | |
520 | 524 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
423 | 423 | | |
424 | 424 | | |
425 | 425 | | |
| 426 | + | |
426 | 427 | | |
427 | 428 | | |
428 | 429 | | |
429 | 430 | | |
430 | 431 | | |
431 | 432 | | |
432 | 433 | | |
| 434 | + | |
433 | 435 | | |
434 | 436 | | |
435 | 437 | | |
436 | 438 | | |
437 | | - | |
438 | | - | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
439 | 442 | | |
440 | 443 | | |
441 | 444 | | |
| |||
508 | 511 | | |
509 | 512 | | |
510 | 513 | | |
511 | | - | |
512 | | - | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
513 | 523 | | |
514 | 524 | | |
515 | 525 | | |
| |||
529 | 539 | | |
530 | 540 | | |
531 | 541 | | |
| 542 | + | |
| 543 | + | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
| 553 | + | |
| 554 | + | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
532 | 569 | | |
533 | 570 | | |
534 | 571 | | |
| |||
709 | 746 | | |
710 | 747 | | |
711 | 748 | | |
712 | | - | |
| 749 | + | |
713 | 750 | | |
714 | 751 | | |
715 | 752 | | |
| |||
719 | 756 | | |
720 | 757 | | |
721 | 758 | | |
722 | | - | |
| 759 | + | |
723 | 760 | | |
724 | 761 | | |
725 | 762 | | |
| |||
1042 | 1079 | | |
1043 | 1080 | | |
1044 | 1081 | | |
| 1082 | + | |
| 1083 | + | |
| 1084 | + | |
| 1085 | + | |
| 1086 | + | |
1045 | 1087 | | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
1046 | 1091 | | |
1047 | 1092 | | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
1048 | 1102 | | |
1049 | | - | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
| 1107 | + | |
| 1108 | + | |
| 1109 | + | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
| 1113 | + | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
1050 | 1123 | | |
1051 | 1124 | | |
1052 | 1125 | | |
| |||
1063 | 1136 | | |
1064 | 1137 | | |
1065 | 1138 | | |
| 1139 | + | |
1066 | 1140 | | |
1067 | 1141 | | |
1068 | 1142 | | |
1069 | 1143 | | |
1070 | 1144 | | |
1071 | 1145 | | |
1072 | 1146 | | |
| 1147 | + | |
1073 | 1148 | | |
1074 | 1149 | | |
1075 | 1150 | | |
| |||
1267 | 1342 | | |
1268 | 1343 | | |
1269 | 1344 | | |
1270 | | - | |
1271 | | - | |
1272 | | - | |
1273 | 1345 | | |
1274 | 1346 | | |
1275 | 1347 | | |
| |||
1310 | 1382 | | |
1311 | 1383 | | |
1312 | 1384 | | |
| 1385 | + | |
| 1386 | + | |
| 1387 | + | |
| 1388 | + | |
| 1389 | + | |
| 1390 | + | |
| 1391 | + | |
| 1392 | + | |
| 1393 | + | |
| 1394 | + | |
| 1395 | + | |
| 1396 | + | |
| 1397 | + | |
| 1398 | + | |
| 1399 | + | |
| 1400 | + | |
| 1401 | + | |
| 1402 | + | |
| 1403 | + | |
| 1404 | + | |
| 1405 | + | |
| 1406 | + | |
| 1407 | + | |
| 1408 | + | |
| 1409 | + | |
| 1410 | + | |
| 1411 | + | |
| 1412 | + | |
| 1413 | + | |
| 1414 | + | |
| 1415 | + | |
| 1416 | + | |
| 1417 | + | |
| 1418 | + | |
| 1419 | + | |
| 1420 | + | |
| 1421 | + | |
| 1422 | + | |
| 1423 | + | |
| 1424 | + | |
| 1425 | + | |
| 1426 | + | |
| 1427 | + | |
| 1428 | + | |
| 1429 | + | |
| 1430 | + | |
| 1431 | + | |
| 1432 | + | |
| 1433 | + | |
| 1434 | + | |
| 1435 | + | |
| 1436 | + | |
| 1437 | + | |
| 1438 | + | |
| 1439 | + | |
| 1440 | + | |
| 1441 | + | |
| 1442 | + | |
| 1443 | + | |
| 1444 | + | |
| 1445 | + | |
| 1446 | + | |
| 1447 | + | |
| 1448 | + | |
1313 | 1449 | | |
1314 | 1450 | | |
1315 | 1451 | | |
| |||
1355 | 1491 | | |
1356 | 1492 | | |
1357 | 1493 | | |
| 1494 | + | |
0 commit comments