Skip to content

Commit 1444271

Browse files
committed
refined process termination
1 parent 3d73470 commit 1444271

File tree

1 file changed

+9
-12
lines changed

1 file changed

+9
-12
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -214,31 +214,28 @@ This is PRS designed exit code. The failure reason can be found in `~/logs/job_r
214214
Error of the job indicates the compute cannot access input data. TODO: links to access data with identity doc
215215

216216
### Processes terminated unexpectly
217-
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions.
217+
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
218+
219+
```
220+
<process-name> exits with returncode -9.
221+
```
218222

219223
#### Out of Memory
220224
`~/logs/perf` logs computation resource comsuptions of each processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
221225

222226
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
223227

224-
We suggest to reduce the number of processes per node if the compute resources is close the limits.
228+
We suggest to reduce the number of processes per node or upgrade vm size if the compute resources is close the limits.
225229

226230
#### Unhandled Exceptions
227-
In some cases the python processes cannot catch the failing stack. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
228-
229-
```
230-
<process-name> exits with returncode -9.
231-
```
232-
233-
You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable the builtin faulthandler.
231+
In some cases the python processes cannot catch the failing stack. You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable python builtin faulthandler.
234232

235233
### Minibatch Timeout
236234
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take much more time than expected, here are some tips.
237235

238-
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO or network operations in the tasks.
239-
240-
- If timeout happens on several specific minibatches only, please check your task input to see if they need more time to process. And try balance your input data partitions to prevent from this issue.
236+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is much shorter than elapsed, please check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations are the common reason of minibatch timeout.
241237

238+
- Some specific minibatches take longer time than others. You can either updated the configuration, or try work with input data to balance the minibatch processing time.
242239

243240
## How do I log from my user script from a remote context?
244241

0 commit comments

Comments
 (0)