Merge pull request #115584 from tmccrmck/trmccorm/update_prs_debug

American-Dipper · web-flow · commit acb606cb089d · 2020-05-19T13:57:22.000-07:00
Updates ParallelRunStep docs to new logging format
diff --git a/articles/machine-learning/how-to-debug-parallel-run-step.md b/articles/machine-learning/how-to-debug-parallel-run-step.md
@@ -23,32 +23,37 @@ See the [Testing scripts locally section](how-to-debug-pipelines.md#testing-scri
 
 ## Debugging scripts from remote context
 
-The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a parallel step run.
+The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a ParallelRunStep.
 
-For example, the log file `70_driver_log.txt` contains information from the controller that launches parallel run step code.
+For example, the log file `70_driver_log.txt` contains information from the controller that launches the ParallelRunStep code.
 
-Because of the distributed nature of parallel run jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
+Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
 
 - `~/logs/overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
 
 - `~/logs/sys/master.txt`: This file provides the master node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
 
-Logs generated from entry script using EntryScript.logger and print statements will be found in following files:
+Logs generated from entry script using EntryScript helper and print statements will be found in following files:
 
-- `~/logs/user/<ip_address>/Process-*.txt`: This file contains logs written from entry_script using EntryScript.logger. It also contains print statement (stdout) from entry_script.
+- `~/logs/user/<node_name>.log.txt`: These are the logs written from entry_script using EntryScript helper. Also contains print statement (stdout) from entry_script.
 
-When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/worker` folder, grouped by worker nodes:
+For a concise understanding of errors in your script there is:
 
-- `~/logs/sys/worker/<ip_address>/Process-*.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes:
+- `~/logs/user/error.txt`: This file will try to summarize the errors in your script.
+
+For more information on errors in your script, there is:
+
+- `~/logs/user/error/`: Contains all errors thrown and full stack traces organized by node.
+
+When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
+
+- `~/logs/sys/node/<node_name>.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes:
 
     - The IP address and the PID of the worker process. 
     - The total number of items, successfully processed items count, and failed item count.
     - The start time, duration, process time and run method time.
 
-You can also find information on the resource usage of the processes for each worker. This information is in CSV format and is located at `~/logs/sys/perf/<ip_address>/`. For a single node, job files will be available under `~logs/sys/perf`. For example, when checking for resource utilization, look at the following files:
-
-- `Process-*.csv`: Per worker process resource usage. 
-- `sys.csv`: Per node log.
+You can also find information on the resource usage of the processes for each worker. This information is in CSV format and is located at `~/logs/sys/perf/overview.csv`. For information about each process, it is available under `~logs/sys/processes.csv`.
 
 ### How do I log from my user script from a remote context?
 You can get a logger from EntryScript as shown in below sample code to make the logs show up in **logs/user** folder in the portal.
@@ -77,19 +82,32 @@ def run(mini_batch):
 
 ### How could I pass a side input such as, a file or file(s) containing a lookup table, to all my workers?
 
-Construct a [Dataset](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) object containing the side input and register with your workspace. After that you can access it in your inference script (for example, in your init() method) as follows:
+Construct a [Dataset](https://docs.microsoft.com/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) containing the side input and register it with your workspace. Pass it to the `side_input` parameter of your `ParallelRunStep`. Additionally, you can add it's path in the `arguments` section to easily access it's mounted path:
+
+```python
+label_config = label_ds.as_named_input("labels_input")
+batch_score_step = ParallelRunStep(
+    name=parallel_step_name,
+    inputs=[input_images.as_named_input("input_images")],
+    output=output_dir,
+    arguments=["--labels_dir", label_config],
+    side_inputs=[label_config],
+    parallel_run_config=parallel_run_config,
+)
+```
+
+After that you can access it in your inference script (for example, in your init() method) as follows:
 
 ```python
-from azureml.core.run import Run
-from azureml.core.dataset import Dataset
+parser = argparse.ArgumentParser()
+parser.add_argument('--labels_dir', dest="labels_dir", required=True)
+args, _ = parser.parse_known_args()
 
-ws = Run.get_context().experiment.workspace
-lookup_ds = Dataset.get_by_name(ws, "<registered-name>")
-lookup_ds.download(target_path='.', overwrite=True)
+labels_path = args.labels_dir
 ```
 
 ## Next steps
 
 * See the SDK reference for help with the [azureml-contrib-pipeline-step](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps?view=azure-ml-py) package and the [documentation](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) for ParallelRunStep class.
 
-* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines with parallel run step.
+* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines with ParallelRunStep and for an example of passing another file as a side input.