You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -208,20 +208,20 @@ You can also view the results of periodical checks of the resource usage for eac
208
208
## Common job failure reasons
209
209
210
210
### SystemExit: 42
211
-
Exit 41 and 42 are PRS designed exit code. Worker nodes exit with 41 to notify compute manager that it terminated independently. A leader node may exit with 0 or 41 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
211
+
Exits 41 and 42 are PRS designed exit codes. Worker nodes exit with 41 to notify compute manager that it terminated independently. It is expected. A leader node may exit with 0 or 42 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212
212
213
213
### Data Permission
214
214
Error of the job indicates the compute cannot access input data. If identity-based is used for your compute cluster and storage, you can refer [Identity-based data authentication](../how-to-administrate-data-authentication.md).
215
215
216
-
### Processes terminated unexpectly
216
+
### Processes terminated unexpectedly
217
217
Processes may crash due to unexpected or unhandled exceptions, the system kills processes due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
218
218
219
219
```
220
220
<process-name> exits with returncode -9.
221
221
```
222
222
223
223
#### Out of Memory
224
-
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
224
+
`~/logs/perf` logs computation resource comsuption of processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
225
225
226
226
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
227
227
@@ -233,9 +233,9 @@ In some cases, the python processes cannot catch the failing stack. You can add
233
233
### Minibatch Timeout
234
234
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take more time than expected, here are some tips.
235
235
236
-
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is significantly shorter than elapsed, you can check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations are the common reason of minibatch timeout.
236
+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is significantly shorter than elapsed, you can check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations is the common reason of minibatch timeout.
237
237
238
-
- Some specific minibatches take longer time than others. You can either updated the configuration, or try work with input data to balance the minibatch processing time.
238
+
- Some specific minibatches take longer time than others. You can either update the configuration, or try work with input data to balance the minibatch processing time.
239
239
240
240
## How do I log from my user script from a remote context?
0 commit comments