You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+13-4Lines changed: 13 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -164,9 +164,9 @@ Because of the distributed nature of ParallelRunStep jobs, there are logs from s
164
164
165
165
-`~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
166
166
167
-
-`~/logs/job_result.txt`: Tt shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
167
+
-`~/logs/job_result.txt`: It shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
168
168
169
-
-`~/logs/job_error.txt`: This file will try to summarize the errors in your script.
169
+
-`~/logs/job_error.txt`: A summarization of the errors in your script.
170
170
171
171
-`~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
172
172
@@ -181,7 +181,7 @@ Logs generated from entry script using EntryScript helper and print statements w
181
181
-`~/logs/user/stderr/<node_id>/<process_name>.stderr.txt`: These files are the logs from stderr of entry_script.
182
182
183
183
184
-
For example, as the screenshot shows minibatch 0 failed on node 1 process000. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/1/process000.log.txt`, `~/logs/user/stdout/1/process000.log.txt` and `~/logs/user/stderr/1/process000.log.txt`
184
+
For example, the screenshot below shows minibatch 0 failed on node 1 process000. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/1/process000.log.txt`, `~/logs/user/stdout/1/process000.log.txt` and `~/logs/user/stderr/1/process000.log.txt`
@@ -205,9 +205,18 @@ You can also view the results of periodical checks of the resource usage for eac
205
205
-`node_resource_usage.csv`: Resource usage overview of the node.
206
206
-`processes_resource_usage.csv`: Resource usage overview of each process.
207
207
208
-
## My job failed with SystemExit: 42. What does it mean?
208
+
## Common job failure reason
209
+
210
+
### SystemExit: 42
209
211
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
210
212
213
+
### Data Permission
214
+
Error of job indicates the compute cannot access input data. TODO: links to access data
215
+
216
+
### Out of Memory
217
+
`~logs/perf` logs usage of computation resources. Job monitor tab has chart to show the status of compute nodes. We suggest to reduce the number of processes per node if the compute reso
218
+
219
+
211
220
## How do I log from my user script from a remote context?
212
221
213
222
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, we recommend using ParallelRunStep logger as shown below. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.
0 commit comments