You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -164,15 +164,15 @@ Because of the distributed nature of ParallelRunStep jobs, there are logs from s
164
164
165
165
-`~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job fails, it shows the error message and where to start the troubleshooting.
166
166
167
-
-`~/logs/job_result.txt`: It shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
167
+
-`~/logs/job_result.txt`: It shows the result of the job. If the job failed, it shows the error message and where to start the troubleshooting.
168
168
169
169
-`~/logs/job_error.txt`: This file summarizes the errors in your script.
170
170
171
171
-`~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
172
172
173
-
-`~/logs/sys/job_report/processed_mini-batches.csv`: A table of all minibatches that has been processed. It shows result of each run of minibatch, its execution agent node id and process name. Also, the elapsed time and error messages are included. Logs for each run of minibatches can be found by following the node id and process name.
173
+
-`~/logs/sys/job_report/processed_mini-batches.csv`: A table of all minibatches that were processed. It shows result of each run of minibatch, its execution agent node id and process name. Also, the elapsed time and error messages are included. Logs for each run of minibatches can be found by following the node id and process name.
174
174
175
-
Logs generated from entry script using EntryScript helper and print statements will be found in following files:
175
+
Logs generated from entry script using EntryScript helper and print statements can be found in following files:
176
176
177
177
-`~/logs/user/entry_script_log/<node_id>/<process_name>.log.txt`: These files are the logs written from entry_script using EntryScript helper.
178
178
@@ -208,10 +208,10 @@ You can also view the results of periodical checks of the resource usage for eac
208
208
## Common job failure reasons
209
209
210
210
### SystemExit: 42
211
-
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
211
+
Exit 41 and 42 are PRS designed exit code. Worker nodes exit with 41 to notify compute manager that it terminated independently. A leader node may exit with 0 or 41 which indicates the job result. Exit 42 means the job failed. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
212
212
213
213
### Data Permission
214
-
Error of the job indicates the compute cannot access input data. TODO: links to access data with identity doc
214
+
Error of the job indicates the compute cannot access input data. If identity-based is used for your compute cluster and storage, please refer [Identity-based data authentication](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-administrate-data-authentication?view=azureml-api-2#identity-based-data-authentication).
215
215
216
216
### Processes terminated unexpectly
217
217
Processes may crash due to unexpected or unhandled exceptions. Or they may be killed by system due to Out of Memory exceptions. In PRS system logs `~/logs/sys/node/<node-id>/_main.txt`, errors like below can be found.
@@ -225,7 +225,7 @@ Processes may crash due to unexpected or unhandled exceptions. Or they may be ki
225
225
226
226
Out of Memory error can be found in `~/system_logs/lifecycler/<node-id>/execution-wrapper.txt`.
227
227
228
-
We suggest to reduce the number of processes per node or upgrade vm size if the compute resources is close the limits.
228
+
We suggest reducing the number of processes per node or upgrade vm size if the compute resources is close the limits.
229
229
230
230
#### Unhandled Exceptions
231
231
In some cases the python processes cannot catch the failing stack. You can add an environment variable ```env["PYTHONFAULTHANDLER"]="true"``` to enable python builtin faulthandler.
0 commit comments