acrolinx

v-dirichards · web-flow · commit cd393b218763 · 2023-06-15T10:39:47.000-05:00
diff --git a/articles/machine-learning/v1/how-to-debug-parallel-run-step.md b/articles/machine-learning/v1/how-to-debug-parallel-run-step.md
@@ -254,7 +254,7 @@ def init():
 ```
 
 ## How to handle log in new processes?
-You can spawn new processes in you entry script with [`subprocess`](https://docs.python.org/3/library/subprocess.html) module, connect to their input/output/error pipes and obtain their return codes.
+You can spawn new processes in your entry script with [`subprocess`](https://docs.python.org/3/library/subprocess.html) module, connect to their input/output/error pipes and obtain their return codes.
 
 The recommended approach is to use the [`run()`](https://docs.python.org/3/library/subprocess.html#subprocess.run) function with `capture_output=True`. Errors will show up in `logs/user/error/<node_id>/<process_name>.txt`.
 
@@ -381,7 +381,7 @@ You can follow the lead in `~logs/job_result.txt` to find the cause and detailed
 Not if there are other available nodes in the designated compute cluster. The orchestrator will start a new node as replacement, and ParallelRunStep is resilient to such operation.
 
 ### What happens if `init` function in entry script fails?
-ParallelRunStep has mechanism to retry for a certain times to give chance for recovery from transient issues without delaying the job failure for too long, the mechanism is as follows:
+ParallelRunStep has mechanism to retry for a certain time to give chance for recovery from transient issues without delaying the job failure for too long, the mechanism is as follows:
 1. If after a node starts, `init` on all agents keeps failing, we will stop trying after `3 * process_count_per_node` failures.
 2. If after job starts, `init` on all agents of all nodes keeps failing, we will stop trying if job runs more than 2 minutes and there're `2 * node_count * process_count_per_node` failures.
 3. If all agents are stuck on `init` for more than `3 * run_invocation_timeout + 30` seconds, the job would fail because of no progress for too long.