You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+27-4Lines changed: 27 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -205,16 +205,39 @@ You can also view the results of periodical checks of the resource usage for eac
205
205
-`node_resource_usage.csv`: Resource usage overview of the node.
206
206
-`processes_resource_usage.csv`: Resource usage overview of each process.
207
207
208
-
## Common job failure reason
208
+
## Common job failure reasons
209
209
210
210
### SystemExit: 42
211
211
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212
212
213
213
### Data Permission
214
-
Error of job indicates the compute cannot access input data. TODO: links to access data
214
+
Error of the job indicates the compute cannot access input data. TODO: links to access data with identity doc
215
215
216
-
### Out of Memory
217
-
`~logs/perf` logs usage of computation resources. Job monitor tab has chart to show the status of compute nodes. We suggest to reduce the number of processes per node if the compute reso
216
+
### Processes terminated unexpectly
217
+
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions.
218
+
219
+
#### Out of Memory
220
+
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
221
+
222
+
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
223
+
224
+
We suggest to reduce the number of processes per node if the compute resources is close the limits.
225
+
226
+
#### Unhandled Exceptions
227
+
In some cases the python processes cannot catch the failing stack. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
228
+
229
+
```
230
+
<process-name> exits with returncode -9.
231
+
```
232
+
233
+
You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable the builtin faulthandler.
234
+
235
+
### Minibatch Timeout
236
+
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take much more time than expected, here are some tips.
237
+
238
+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO or network operations in the tasks.
239
+
240
+
- If timeout happens on several specific minibatches only, please check your task input to see if they need more time to process. And try balance your input data partitions to prevent from this issue.
218
241
219
242
220
243
## How do I log from my user script from a remote context?
0 commit comments