Skip to content

Commit 3d73470

Browse files
committed
add common failure reason
1 parent 155ff4e commit 3d73470

File tree

1 file changed

+27
-4
lines changed

1 file changed

+27
-4
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -205,16 +205,39 @@ You can also view the results of periodical checks of the resource usage for eac
205205
- `node_resource_usage.csv`: Resource usage overview of the node.
206206
- `processes_resource_usage.csv`: Resource usage overview of each process.
207207

208-
## Common job failure reason
208+
## Common job failure reasons
209209

210210
### SystemExit: 42
211211
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212212

213213
### Data Permission
214-
Error of job indicates the compute cannot access input data. TODO: links to access data
214+
Error of the job indicates the compute cannot access input data. TODO: links to access data with identity doc
215215

216-
### Out of Memory
217-
`~logs/perf` logs usage of computation resources. Job monitor tab has chart to show the status of compute nodes. We suggest to reduce the number of processes per node if the compute reso
216+
### Processes terminated unexpectly
217+
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions.
218+
219+
#### Out of Memory
220+
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
221+
222+
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
223+
224+
We suggest to reduce the number of processes per node if the compute resources is close the limits.
225+
226+
#### Unhandled Exceptions
227+
In some cases the python processes cannot catch the failing stack. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
228+
229+
```
230+
<process-name> exits with returncode -9.
231+
```
232+
233+
You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable the builtin faulthandler.
234+
235+
### Minibatch Timeout
236+
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take much more time than expected, here are some tips.
237+
238+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO or network operations in the tasks.
239+
240+
- If timeout happens on several specific minibatches only, please check your task input to see if they need more time to process. And try balance your input data partitions to prevent from this issue.
218241

219242

220243
## How do I log from my user script from a remote context?

0 commit comments

Comments
 (0)