Skip to content

Commit 155ff4e

Browse files
committed
adding common job failure reason
1 parent 36a14bc commit 155ff4e

File tree

1 file changed

+13
-4
lines changed

1 file changed

+13
-4
lines changed

articles/machine-learning/v1/how-to-debug-parallel-run-step.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -164,9 +164,9 @@ Because of the distributed nature of ParallelRunStep jobs, there are logs from s
164164

165165
- `~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
166166

167-
- `~/logs/job_result.txt`: Tt shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
167+
- `~/logs/job_result.txt`: It shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
168168

169-
- `~/logs/job_error.txt`: This file will try to summarize the errors in your script.
169+
- `~/logs/job_error.txt`: A summarization of the errors in your script.
170170

171171
- `~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
172172

@@ -181,7 +181,7 @@ Logs generated from entry script using EntryScript helper and print statements w
181181
- `~/logs/user/stderr/<node_id>/<process_name>.stderr.txt`: These files are the logs from stderr of entry_script.
182182

183183

184-
For example, as the screenshot shows minibatch 0 failed on node 1 process000. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/1/process000.log.txt`, `~/logs/user/stdout/1/process000.log.txt` and `~/logs/user/stderr/1/process000.log.txt`
184+
For example, the screenshot below shows minibatch 0 failed on node 1 process000. The corresponding logs for your entry script can be found in `~/logs/user/entry_script_log/1/process000.log.txt`, `~/logs/user/stdout/1/process000.log.txt` and `~/logs/user/stderr/1/process000.log.txt`
185185

186186
![Sample processed_mini-batches.csv file](media/how-to-debug-parallel-run-step/processed_mini_batches_csv_screenshot.png)
187187

@@ -205,9 +205,18 @@ You can also view the results of periodical checks of the resource usage for eac
205205
- `node_resource_usage.csv`: Resource usage overview of the node.
206206
- `processes_resource_usage.csv`: Resource usage overview of each process.
207207

208-
## My job failed with SystemExit: 42. What does it mean?
208+
## Common job failure reason
209+
210+
### SystemExit: 42
209211
This is PRS designed exit code. The failure reason can be found in `~/logs/job_result.txt`. You can follow previous section to debug your job.
210212

213+
### Data Permission
214+
Error of job indicates the compute cannot access input data. TODO: links to access data
215+
216+
### Out of Memory
217+
`~logs/perf` logs usage of computation resources. Job monitor tab has chart to show the status of compute nodes. We suggest to reduce the number of processes per node if the compute reso
218+
219+
211220
## How do I log from my user script from a remote context?
212221

213222
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, we recommend using ParallelRunStep logger as shown below. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.

0 commit comments

Comments
 (0)