You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+9-12Lines changed: 9 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -214,31 +214,28 @@ This is PRS designed exit code. The failure reason can be found in `~/logs/job_r
214
214
Error of the job indicates the compute cannot access input data. TODO: links to access data with identity doc
215
215
216
216
### Processes terminated unexpectly
217
-
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions.
217
+
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
218
+
219
+
```
220
+
<process-name> exits with returncode -9.
221
+
```
218
222
219
223
#### Out of Memory
220
224
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
221
225
222
226
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
223
227
224
-
We suggest to reduce the number of processes per node if the compute resources is close the limits.
228
+
We suggest to reduce the number of processes per node or upgrade vm size if the compute resources is close the limits.
225
229
226
230
#### Unhandled Exceptions
227
-
In some cases the python processes cannot catch the failing stack. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
228
-
229
-
```
230
-
<process-name> exits with returncode -9.
231
-
```
232
-
233
-
You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable the builtin faulthandler.
231
+
In some cases the python processes cannot catch the failing stack. You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable python builtin faulthandler.
234
232
235
233
### Minibatch Timeout
236
234
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take much more time than expected, here are some tips.
237
235
238
-
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO or network operations in the tasks.
239
-
240
-
- If timeout happens on several specific minibatches only, please check your task input to see if they need more time to process. And try balance your input data partitions to prevent from this issue.
236
+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations are the common reason of minibatch timeout.
241
237
238
+
- Some specific minibatches take longer time than others. You can either updated the configuration, or try work with input data to balance the minibatch processing time.
242
239
243
240
## How do I log from my user script from a remote context?
0 commit comments