Skip to content

Commit 36a14bc

Browse files
committed
updating prs debug guide
1 parent 6b12588 commit 36a14bc

File tree

2 files changed

+12
-7
lines changed

2 files changed

+12
-7
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 12 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -160,14 +160,18 @@ parallelrun_step = ParallelRunStep(
160160

161161
The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, see [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md). The information in that section also applies to a ParallelRunStep.
162162

163-
For example, the log file `70_driver_log.txt` contains information from the controller that launches the ParallelRunStep code.
164-
165163
Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
166164

167165
- `~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
168166

167+
- `~/logs/job_result.txt`: Tt shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
168+
169+
- `~/logs/job_error.txt`: This file will try to summarize the errors in your script.
170+
169171
- `~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
170172

173+
- `~/logs/sys/job_report/processed_mini-batches.csv`: A table of all minibatches that has been processed. It shows result of each run of minibatch, its execution agent node id and process name. Also, the elapsed time and error messages are included. Logs for each run of minibatches can be found by following the node id and process name.
174+
171175
Logs generated from entry script using EntryScript helper and print statements will be found in following files:
172176

173177
- `~/logs/user/entry_script_log/<node_id>/<process_name>.log.txt`: These files are the logs written from entry_script using EntryScript helper.
@@ -176,15 +180,13 @@ Logs generated from entry script using EntryScript helper and print statements w
176180

177181
- `~/logs/user/stderr/<node_id>/<process_name>.stderr.txt`: These files are the logs from stderr of entry_script.
178182

179-
For a concise understanding of errors in your script there is:
180183

181-
- `~/logs/user/error.txt`: This file will try to summarize the errors in your script.
184+
For example, as the screenshot shows minibatch 0 failed on node 1 process000. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/1/process000.log.txt`, `~/logs/user/stdout/1/process000.log.txt` and `~/logs/user/stderr/1/process000.log.txt`
182185

183-
For more information on errors in your script, there is:
186+
![Sample processed_mini-batches.csv file](media/how-to-debug-parallel-run-step/processed_mini_batches_csv_screenshot.png)
184187

185-
- `~/logs/user/error/`: Contains full stack traces of exceptions thrown while loading and running entry script.
186188

187-
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
189+
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `~/logs/sys/node` folder, grouped by worker nodes:
188190

189191
- `~/logs/sys/node/<node_id>/<process_name>.txt`: This file provides detailed info about each mini-batch as it's picked up or completed by a worker. For each mini-batch, this file includes:
190192

@@ -203,6 +205,9 @@ You can also view the results of periodical checks of the resource usage for eac
203205
- `node_resource_usage.csv`: Resource usage overview of the node.
204206
- `processes_resource_usage.csv`: Resource usage overview of each process.
205207

208+
## My job failed with SystemExit: 42. What does it mean?
209+
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
210+
206211
## How do I log from my user script from a remote context?
207212

208213
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, we recommend using ParallelRunStep logger as shown below. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.
111 KB
Loading

0 commit comments

Comments
 (0)