Skip to content

Commit cd393b2

Browse files
authored
acrolinx
1 parent e1b1bd4 commit cd393b2

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ def init():
254254
```
255255

256256
## How to handle log in new processes?
257-
You can spawn new processes in you entry script with [`subprocess`](https://docs.python.org/3/library/subprocess.html) module, connect to their input/output/error pipes and obtain their return codes.
257+
You can spawn new processes in your entry script with [`subprocess`](https://docs.python.org/3/library/subprocess.html) module, connect to their input/output/error pipes and obtain their return codes.
258258

259259
The recommended approach is to use the [`run()`](https://docs.python.org/3/library/subprocess.html#subprocess.run) function with `capture_output=True`. Errors will show up in `logs/user/error/<node_id>/<process_name>.txt`.
260260

@@ -381,7 +381,7 @@ You can follow the lead in `~logs/job_result.txt` to find the cause and detailed
381381
Not if there are other available nodes in the designated compute cluster. The orchestrator will start a new node as replacement, and ParallelRunStep is resilient to such operation.
382382

383383
### What happens if `init` function in entry script fails?
384-
ParallelRunStep has mechanism to retry for a certain times to give chance for recovery from transient issues without delaying the job failure for too long, the mechanism is as follows:
384+
ParallelRunStep has mechanism to retry for a certain time to give chance for recovery from transient issues without delaying the job failure for too long, the mechanism is as follows:
385385
1. If after a node starts, `init` on all agents keeps failing, we will stop trying after `3 * process_count_per_node` failures.
386386
2. If after job starts, `init` on all agents of all nodes keeps failing, we will stop trying if job runs more than 2 minutes and there're `2 * node_count * process_count_per_node` failures.
387387
3. If all agents are stuck on `init` for more than `3 * run_invocation_timeout + 30` seconds, the job would fail because of no progress for too long.

0 commit comments

Comments
 (0)