Skip to content

Commit 2fb87d6

Browse files
authored
Merge pull request #2393 from lusu-msft/lusu/prs-debug
Update PRS debug guide
2 parents fba25f4 + 0e5a501 commit 2fb87d6

File tree

2 files changed

+44
-10
lines changed

2 files changed

+44
-10
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 44 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -160,14 +160,18 @@ parallelrun_step = ParallelRunStep(
160160

161161
The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, see [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md). Information in that section also applies to a ParallelRunStep.
162162

163-
For example, the log file `70_driver_log.txt` contains information from the controller that launches the ParallelRunStep code.
164-
165163
Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
166164

167165
- `~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job fails, it shows the error message and where to start the troubleshooting.
168166

167+
- `~/logs/job_result.txt`: It shows the result of the job. If the job failed, it shows the error message and where to start the troubleshooting.
168+
169+
- `~/logs/job_error.txt`: This file summarizes the errors in your script.
170+
169171
- `~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
170172

173+
- `~/logs/sys/job_report/processed_mini-batches.csv`: A table of all minibatches that were processed. It shows result of each run of minibatch, its execution agent node id and process name. Also, the elapsed time and error messages are included. Logs for each run of minibatches can be found by following the node id and process name.
174+
171175
Logs generated from entry script using EntryScript helper and print statements can be found in following files:
172176

173177
- `~/logs/user/entry_script_log/<node_id>/<process_name>.log.txt`: These files are the logs written from entry_script using EntryScript helper.
@@ -176,15 +180,13 @@ Logs generated from entry script using EntryScript helper and print statements c
176180

177181
- `~/logs/user/stderr/<node_id>/<process_name>.stderr.txt`: These files are the logs from stderr of entry_script.
178182

179-
For a concise understanding of errors in your script there is:
180183

181-
- `~/logs/user/error.txt`: This file summarizes the errors in your script.
184+
For example, the screenshot shows minibatch 0 failed on node 0 process001. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/0/process001.log.txt`, `~/logs/user/stdout/0/process001.log.txt` and `~/logs/user/stderr/0/process001.log.txt`
182185

183-
For more information on errors in your script, there is:
186+
![Screenshot of a sample processed_mini-batches.csv file.](media/how-to-debug-parallel-run-step/processed-mini-batches-csv-screenshot.png)
184187

185-
- `~/logs/user/error/`: Contains full stack traces of exceptions thrown while loading and running entry script.
186188

187-
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
189+
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `~/logs/sys/node` folder, grouped by worker nodes:
188190

189191
- `~/logs/sys/node/<node_id>/<process_name>.txt`: This file provides detailed info about each mini-batch that was picked up or completed by a worker. For each mini-batch, this file includes:
190192

@@ -203,6 +205,38 @@ You can also view the results of periodical checks of the resource usage for eac
203205
- `node_resource_usage.csv`: Resource usage overview of the node.
204206
- `processes_resource_usage.csv`: Resource usage overview of each process.
205207

208+
## Common job failure reasons
209+
210+
### SystemExit: 42
211+
Exits 41 and 42 are PRS designed exit codes. Worker nodes exit with 41 to notify compute manager that it terminated independently. It is expected. A leader node may exit with 0 or 42 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212+
213+
### Data Permission
214+
Error of the job indicates the compute cannot access input data. If identity-based is used for your compute cluster and storage, you can refer [Identity-based data authentication](../how-to-administrate-data-authentication.md).
215+
216+
### Processes terminated unexpectedly
217+
Processes may crash due to unexpected or unhandled exceptions, the system kills processes due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
218+
219+
```
220+
<process-name> exits with returncode -9.
221+
```
222+
223+
#### Out of Memory
224+
`~/logs/perf` logs computation resource consumption of processes. The memory usage of each task processor can be found. You can estimate the total memory usage on the node.
225+
226+
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
227+
228+
We suggest reducing the number of processes per node or upgrade vm size if the compute resources is close the limits.
229+
230+
#### Unhandled Exceptions
231+
In some cases, the python processes cannot catch the failing stack. You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable python builtin fault handler.
232+
233+
### Minibatch Timeout
234+
You can adjust `run_invocation_timeout` argument according to your minibatch tasks. When you are seeing the run() functions take more time than expected, here are some tips.
235+
236+
- Check the elapsed time and process time of the minibatch. The process time measures CPU time of the process. When process time is significantly shorter than elapsed, you can check if there are some heavy IO operations or network requests in the tasks. Long latency of those operations is the common reason of minibatch timeout.
237+
238+
- Some specific minibatches take longer time than others. You can either update the configuration, or try work with input data to balance the minibatch processing time.
239+
206240
## How do I log from my user script from a remote context?
207241

208242
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, ParallelRunStep logger shown as below is recommended. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.
@@ -367,11 +401,11 @@ This section is about how to check the progress of a ParallelRunStep job and che
367401
Besides looking at the overall status of the StepRun, the count of scheduled/processed mini-batches and the progress of generating output can be viewed in `~/logs/job_progress_overview.<timestamp>.txt`. The file rotates on daily basis. You can check the one with the largest timestamp for the latest information.
368402

369403
### What should I check if there is no progress for a while?
370-
You can go into `~/logs/sys/error` to see if there's any exception. If there is none, it is likely that your entry script is taking a long time, you can print out progress information in your code to locate the time-consuming part, or add `"--profiling_module", "cProfile"` to the `arguments` of `ParallelRunStep` to generate a profile file named as `<process_name>.profile` under `~/logs/sys/node/<node_id>` folder.
404+
You can go into `~/logs/sys/error` to see if there's any exception. If there is none, it is likely that your entry script is taking a long time, you can print progress information in your code to locate the time-consuming part, or add `"--profiling_module", "cProfile"` to the `arguments` of `ParallelRunStep` to generate a profile file named as `<process_name>.profile` under `~/logs/sys/node/<node_id>` folder.
371405

372406
### When will a job stop?
373407
If not canceled, the job may stop with status:
374-
- Completed. If all mini-batches have been processed and output has been generated for `append_row` mode.
408+
- Completed. All mini-batches are processed successfully and output is generated for `append_row` mode.
375409
- Failed. If `error_threshold` in [`Parameters for ParallelRunConfig`](#parameters-for-parallelrunconfig) is exceeded, or system error occurs during the job.
376410

377411
### Where to find the root cause of failure?
@@ -381,7 +415,7 @@ You can follow the lead in `~/logs/job_result.txt` to find the cause and detaile
381415
Not if there are other available nodes in the designated compute cluster. ParallelRunStep can run independently on each node. Single node failure doesn't fail the whole job.
382416

383417
### What happens if `init` function in entry script fails?
384-
ParallelRunStep has mechanism to retry for a certain time to give chance for recovery from transient issues without delaying the job failure for too long, the mechanism is as follows:
418+
ParallelRunStep has mechanism to retry for a certain time to give chance for recovery from transient issues without delaying the job failure for too long. The mechanism is as follows:
385419
1. If after a node starts, `init` on all agents keeps failing, we will stop trying after `3 * process_count_per_node` failures.
386420
2. If after job starts, `init` on all agents of all nodes keeps failing, we will stop trying if job runs more than 2 minutes and there're `2 * node_count * process_count_per_node` failures.
387421
3. If all agents are stuck on `init` for more than `3 * run_invocation_timeout + 30` seconds, the job would fail because of no progress for too long.
149 KB
Loading

0 commit comments

Comments
 (0)