Commit 637afae
fix(torchrun): Omit empty arguments and correct nproc_per_node type (#661)
* fix(torchrun): Omit empty arguments and correct nproc_per_node type
The command generation logic is updated to dynamically
build the torchrun command, excluding arguments that
are empty or None. This prevents them from overriding
environment variables, ensuring that torchrun can
correctly inherit its configuration. An exception is
made for integer arguments where 0 is a valid value.
Additionally, the nproc_per_node argument type has been
changed from int to str to support special values
accepted by PyTorch, such as 'auto', 'gpu', and 'cpu'.
Reference: https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py#L77-L88
Signed-off-by: Saad Zaher <szaher@redhat.com>
* only dynamically add torchrun args & change rdzv_id type to str
Signed-off-by: Saad Zaher <szaher@redhat.com>
* fix smoke tests
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Enable both dtypes str, int for nproc_per_node, rdzv_id
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Use python3.11 style for pydatnic model
Signed-off-by: Saad Zaher <szaher@redhat.com>
* add all torchrun args and validate them
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Remove non-required dependencies
Signed-off-by: Saad Zaher <szaher@redhat.com>
* update datatypes only
Signed-off-by: Saad Zaher <szaher@redhat.com>
* replace _ with - when passing torchrun args
Signed-off-by: Saad Zaher <szaher@redhat.com>
* make nproc_per_node to only accept gpu or int
Signed-off-by: Saad Zaher <szaher@redhat.com>
* add master_{addr, port} validate args
Signed-off-by: Saad Zaher <szaher@redhat.com>
* check for not set or empty rdzv endpoint
Signed-off-by: Saad Zaher <szaher@redhat.com>
* fix formatting error
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Update src/instructlab/training/config.py
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Update tests/smoke/test_train.py
Signed-off-by: Saad Zaher <szaher@redhat.com>
* Update src/instructlab/training/main_ds.py
Signed-off-by: Saad Zaher <szaher@redhat.com>
* fixes indentation
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
* formatting
* add standalone as the fallback when neither master_addr nor rdzv_endpoint are provided
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
* clarify rdzv-backend arg
---------
Signed-off-by: Saad Zaher <szaher@redhat.com>
Signed-off-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>
Co-authored-by: Oleg Silkin <97077423+RobotSail@users.noreply.github.com>1 parent 2c8d676 commit 637afae
2 files changed
+67
-23
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
74 | | - | |
75 | | - | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
76 | 78 | | |
77 | 79 | | |
78 | | - | |
| 80 | + | |
| 81 | + | |
79 | 82 | | |
80 | 83 | | |
81 | | - | |
82 | | - | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
83 | 100 | | |
84 | 101 | | |
85 | 102 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
480 | 480 | | |
481 | 481 | | |
482 | 482 | | |
| 483 | + | |
483 | 484 | | |
484 | 485 | | |
| 486 | + | |
485 | 487 | | |
486 | | - | |
487 | | - | |
488 | | - | |
489 | | - | |
490 | | - | |
491 | | - | |
492 | | - | |
493 | | - | |
494 | | - | |
495 | | - | |
496 | | - | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
| 488 | + | |
| 489 | + | |
502 | 490 | | |
503 | 491 | | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
| 506 | + | |
| 507 | + | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
| 513 | + | |
| 514 | + | |
| 515 | + | |
| 516 | + | |
| 517 | + | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
504 | 531 | | |
505 | 532 | | |
506 | 533 | | |
| |||
0 commit comments