Skip to content

Commit ca54275

Browse files
committed
fix grammer
1 parent ff41c69 commit ca54275

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -208,20 +208,20 @@ You can also view the results of periodical checks of the resource usage for eac
208208
## Common job failure reasons
209209

210210
### SystemExit: 42
211-
Exit 41 and 42 are PRS designed exit code. Worker nodes exit with 41 to notify compute manager that it terminated independently. A leader node may exit with 0 or 41 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
211+
Exits 41 and 42 are PRS designed exit codes. Worker nodes exit with 41 to notify compute manager that it terminated independently. It is expected. A leader node may exit with 0 or 42 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212212

213213
### Data Permission
214214
Error of the job indicates the compute cannot access input data. If identity-based is used for your compute cluster and storage, you can refer [Identity-based data authentication](../how-to-administrate-data-authentication.md).
215215

216-
### Processes terminated unexpectly
216+
### Processes terminated unexpectedly
217217
Processes may crash due to unexpected or unhandled exceptions, the system kills processes due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
218218

219219
```
220220
<process-name> exits with returncode -9.
221221
```
222222

223223
#### Out of Memory
224-
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
224+
`~/logs/perf` logs computation resource comsuption of processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
225225

226226
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
227227

@@ -233,9 +233,9 @@ In some cases, the python processes cannot catch the failing stack. You can add
233233
### Minibatch Timeout
234234
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take more time than expected, here are some tips.
235235

236-
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is significantly shorter than elapsed, you can check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations are the common reason of minibatch timeout.
236+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is significantly shorter than elapsed, you can check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations is the common reason of minibatch timeout.
237237

238-
- Some specific minibatches take longer time than others. You can either updated the configuration, or try work with input data to balance the minibatch processing time.
238+
- Some specific minibatches take longer time than others. You can either update the configuration, or try work with input data to balance the minibatch processing time.
239239

240240
## How do I log from my user script from a remote context?
241241

0 commit comments

Comments
 (0)