Skip to content

Commit acb606c

Browse files
Merge pull request #115584 from tmccrmck/trmccorm/update_prs_debug
Updates ParallelRunStep docs to new logging format
2 parents 681bf98 + 6c85c58 commit acb606c

File tree

1 file changed

+36
-18
lines changed

1 file changed

+36
-18
lines changed

articles/machine-learning/how-to-debug-parallel-run-step.md

Lines changed: 36 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,32 +23,37 @@ See the [Testing scripts locally section](how-to-debug-pipelines.md#testing-scri
2323

2424
## Debugging scripts from remote context
2525

26-
The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a parallel step run.
26+
The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a ParallelRunStep.
2727

28-
For example, the log file `70_driver_log.txt` contains information from the controller that launches parallel run step code.
28+
For example, the log file `70_driver_log.txt` contains information from the controller that launches the ParallelRunStep code.
2929

30-
Because of the distributed nature of parallel run jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
30+
Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
3131

3232
- `~/logs/overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
3333

3434
- `~/logs/sys/master.txt`: This file provides the master node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
3535

36-
Logs generated from entry script using EntryScript.logger and print statements will be found in following files:
36+
Logs generated from entry script using EntryScript helper and print statements will be found in following files:
3737

38-
- `~/logs/user/<ip_address>/Process-*.txt`: This file contains logs written from entry_script using EntryScript.logger. It also contains print statement (stdout) from entry_script.
38+
- `~/logs/user/<node_name>.log.txt`: These are the logs written from entry_script using EntryScript helper. Also contains print statement (stdout) from entry_script.
3939

40-
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/worker` folder, grouped by worker nodes:
40+
For a concise understanding of errors in your script there is:
4141

42-
- `~/logs/sys/worker/<ip_address>/Process-*.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes:
42+
- `~/logs/user/error.txt`: This file will try to summarize the errors in your script.
43+
44+
For more information on errors in your script, there is:
45+
46+
- `~/logs/user/error/`: Contains all errors thrown and full stack traces organized by node.
47+
48+
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
49+
50+
- `~/logs/sys/node/<node_name>.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes:
4351

4452
- The IP address and the PID of the worker process.
4553
- The total number of items, successfully processed items count, and failed item count.
4654
- The start time, duration, process time and run method time.
4755

48-
You can also find information on the resource usage of the processes for each worker. This information is in CSV format and is located at `~/logs/sys/perf/<ip_address>/`. For a single node, job files will be available under `~logs/sys/perf`. For example, when checking for resource utilization, look at the following files:
49-
50-
- `Process-*.csv`: Per worker process resource usage.
51-
- `sys.csv`: Per node log.
56+
You can also find information on the resource usage of the processes for each worker. This information is in CSV format and is located at `~/logs/sys/perf/overview.csv`. For information about each process, it is available under `~logs/sys/processes.csv`.
5257

5358
### How do I log from my user script from a remote context?
5459
You can get a logger from EntryScript as shown in below sample code to make the logs show up in **logs/user** folder in the portal.
@@ -77,19 +82,32 @@ def run(mini_batch):
7782

7883
### How could I pass a side input such as, a file or file(s) containing a lookup table, to all my workers?
7984

80-
Construct a [Dataset](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) object containing the side input and register with your workspace. After that you can access it in your inference script (for example, in your init() method) as follows:
85+
Construct a [Dataset](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) containing the side input and register it with your workspace. Pass it to the `side_input` parameter of your `ParallelRunStep`. Additionally, you can add it's path in the `arguments` section to easily access it's mounted path:
86+
87+
```python
88+
label_config = label_ds.as_named_input("labels_input")
89+
batch_score_step = ParallelRunStep(
90+
name=parallel_step_name,
91+
inputs=[input_images.as_named_input("input_images")],
92+
output=output_dir,
93+
arguments=["--labels_dir", label_config],
94+
side_inputs=[label_config],
95+
parallel_run_config=parallel_run_config,
96+
)
97+
```
98+
99+
After that you can access it in your inference script (for example, in your init() method) as follows:
81100

82101
```python
83-
from azureml.core.run import Run
84-
from azureml.core.dataset import Dataset
102+
parser = argparse.ArgumentParser()
103+
parser.add_argument('--labels_dir', dest="labels_dir", required=True)
104+
args, _ = parser.parse_known_args()
85105

86-
ws = Run.get_context().experiment.workspace
87-
lookup_ds = Dataset.get_by_name(ws, "<registered-name>")
88-
lookup_ds.download(target_path='.', overwrite=True)
106+
labels_path = args.labels_dir
89107
```
90108

91109
## Next steps
92110

93111
* See the SDK reference for help with the [azureml-contrib-pipeline-step](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps?view=azure-ml-py) package and the [documentation](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) for ParallelRunStep class.
94112

95-
* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines with parallel run step.
113+
* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines with ParallelRunStep and for an example of passing another file as a side input.

0 commit comments

Comments
 (0)