Skip to content

Commit 8a2defd

Browse files
authored
Merge pull request #97797 from Blackmist/debug-batch
Debug batch
2 parents 16eba2e + ca271d9 commit 8a2defd

File tree

3 files changed

+207
-6
lines changed

3 files changed

+207
-6
lines changed
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
---
2+
title: Debug and troubleshoot ParallelRunStep
3+
titleSuffix: Azure Machine Learning
4+
description: Debug and troubleshoot machine learning pipelines in the Azure Machine Learning SDK for Python. Learn common pitfalls for developing with pipelines, and tips to help you debug your scripts before and during remote execution.
5+
services: machine-learning
6+
ms.service: machine-learning
7+
ms.subservice: core
8+
ms.topic: conceptual
9+
ms.reviewer: trbye, jmartens, larryfr, vaidyas
10+
ms.author: trmccorm
11+
author: tmccrmck
12+
ms.date: 11/21/2019
13+
---
14+
15+
# Debug and troubleshoot using ParallelRun
16+
[!INCLUDE [applies-to-skus](../../../includes/aml-applies-to-basic-enterprise-sku.md)]
17+
18+
In this article, you learn how to debug and troubleshoot the [ParallelRunStep](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) class from the [Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml/intro?view=azure-ml-py).
19+
20+
## Testing scripts locally
21+
22+
See the [Testing scripts locally section](how-to-debug-pipelines.md#testing-scripts-locally) for machine learning pipelines. Your ParallelRunStep runs as a step in ML pipelines so the same answer applies to both.
23+
24+
## Debugging scripts from remote context
25+
26+
The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a batch inference run.
27+
28+
For example, the log file `70_driver_log.txt` also contains:
29+
30+
* All printed statements during your script's execution.
31+
* The stack trace of the script.
32+
33+
Because of the distributed nature of Batch inference jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
34+
35+
- `~/logs/overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
36+
37+
- `~/logs/master.txt`: This file provides the master node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run's result.
38+
39+
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `worker` folder, grouped by worker nodes:
40+
41+
- `~/logs/worker/<ip_address>/Process-*.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes:
42+
43+
- The IP address and the PID of the worker process.
44+
- The total number of items and the number of successfully processed items.
45+
- The start and end time in wall-clock times (`start1` and `end1`).
46+
- The start and end time in processor time spent (`start2` and `end2`).
47+
48+
You can also find information on the resource usage of the processes for each worker. This information is in CSV format, and is located at `~/logs/performance/<ip_address>/`. For example, when checking for resource utilization, look at the following files:
49+
50+
- `process_resource_monitor_<ip>_<pid>.csv`: Per worker process resource usage.
51+
- `sys_resource_monitor_<ip>.csv`: Per node log.
52+
53+
### How do I log from my user script from a remote context?
54+
You can set up a logger with the below steps to make the logs show up in **logs/users** folder in the portal:
55+
1. Save the first code section below into file entry_script_helper.py and put the file in same folder as your entry script. This class gets the path inside AmlCompute. For local test, you can change `get_working_dir()` to return a local folder.
56+
2. Config a logger in your `init()` method and then use it. The second code section below is an example.
57+
58+
**entry_script_helper.py:**
59+
```python
60+
"""
61+
This module provides helper features for entry script.
62+
This file should be in Python search paths or in the same folder as the entry script.
63+
"""
64+
import os
65+
import socket
66+
import logging
67+
import time
68+
from multiprocessing import current_process
69+
from azureml.core import Run
70+
71+
72+
class EntryScriptHelper:
73+
""" A helper to provide common features for entry script."""
74+
75+
LOG_CONFIGED = False
76+
77+
def get_logger(self, name="EntryScript"):
78+
""" Return a logger.
79+
The logger will write to the 'users' folder and show up in azure portal.
80+
"""
81+
return logging.getLogger(name)
82+
83+
def config(self, name="EntryScript", level="INFO"):
84+
""" Config a logger. This should be called in init() in score module.
85+
Config the logger one time if not configed.
86+
The logger will write to the 'users' folder and show up in azure portal.
87+
"""
88+
logger = logging.getLogger(name)
89+
90+
formatter = logging.Formatter(
91+
"%(asctime)s|%(name)s|%(levelname)s|%(process)d|%(thread)d|%(funcName)s()|%(message)s"
92+
)
93+
formatter.converter = time.gmtime
94+
95+
logger.setLevel(level)
96+
97+
handler = logging.FileHandler(self.get_log_file_path())
98+
handler.setLevel(level)
99+
handler.setFormatter(formatter)
100+
logger.addHandler(handler)
101+
102+
return logger
103+
104+
def get_log_file_path(self):
105+
""" Get the log file path for users.
106+
Each process has its own log file, so there is not race issue among multiple processes.
107+
"""
108+
ip_address = socket.gethostbyname(socket.gethostname())
109+
log_dir = os.path.join(self.get_log_dir(), "user", ip_address)
110+
os.makedirs(log_dir, exist_ok=True)
111+
return os.path.join(log_dir, f"{current_process().name}.txt")
112+
113+
def get_log_dir(self):
114+
""" Return the folder for logs.
115+
Files and folders in it will be uploaded and show up in run detail page in the azure portal.
116+
"""
117+
log_dir = os.path.join(self.get_working_dir(), "logs")
118+
os.makedirs(log_dir, exist_ok=True)
119+
return log_dir
120+
121+
def get_working_dir(self):
122+
""" Return the working directory."""
123+
return os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), self.get_run().id)
124+
125+
def get_temp_dir(self):
126+
""" Return local temp directory."""
127+
local_temp_dir = os.path.join(
128+
os.environ.get("AZ_BATCHAI_JOB_TEMP", ""), "azureml-bi", str(os.getpid())
129+
)
130+
os.makedirs(local_temp_dir, exist_ok=True)
131+
return local_temp_dir
132+
133+
def get_run(self):
134+
""" Return the Run from the context."""
135+
return Run.get_context(allow_offline=False)
136+
137+
```
138+
139+
**A sample entry script using the logger:**
140+
```python
141+
"""
142+
This is a sample scoring module.
143+
144+
This module provides a sample which passes the input back without any change.
145+
"""
146+
import os
147+
import logging
148+
from entry_script_helper import EntryScriptHelper
149+
150+
LOG_NAME = "score_file_list"
151+
152+
153+
def init():
154+
""" Init """
155+
EntryScriptHelper().config(LOG_NAME)
156+
logger = logging.getLogger(LOG_NAME)
157+
output_folder = os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), "temp/output")
158+
logger.info(f"{__file__}.output_folder:{output_folder}")
159+
logger.info("init()")
160+
os.makedirs(output_folder, exist_ok=True)
161+
162+
163+
def run(mini_batch):
164+
""" Accept and return the list back."""
165+
logger = logging.getLogger(LOG_NAME)
166+
logger.info(f"{__file__}, run({mini_batch})")
167+
168+
output_folder = os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), "temp/output")
169+
for file_name in mini_batch:
170+
with open(file_name, "r") as file:
171+
lines = file.readlines()
172+
base_name = os.path.basename(file_name)
173+
name = os.path.join(output_folder, base_name)
174+
logger.info(f"{__file__}: {name}")
175+
with open(name, "w") as file:
176+
file.write(f"ouput file {name} from {__file__}:\n")
177+
for line in lines:
178+
file.write(line)
179+
180+
return mini_batch
181+
```
182+
183+
## Next steps
184+
185+
* See the SDK reference for help with the [azureml-contrib-pipeline-step](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps?view=azure-ml-py) package and the [documentation](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) for ParallelRunStep class.
186+
187+
* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines for batch scoring.

articles/machine-learning/service/how-to-run-batch-predictions.md

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,11 @@ ms.custom: Ignite2019
1717
# Run batch inference on large amounts of data by using Azure Machine Learning
1818
[!INCLUDE [applies-to-skus](../../../includes/aml-applies-to-basic-enterprise-sku.md)]
1919

20-
In this how-to, you learn how to get inferences on large amounts of data asynchronously and in parallel by using Azure Machine Learning. The batch inference capability described here is in public preview. It's a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
20+
Learn how to get inferences on large amounts of data asynchronously and in parallel by using Azure Machine Learning. The batch inference capability described here is in public preview. It's a high-performance and high-throughput way to generate inferences and processing data. It provides asynchronous capabilities out of the box.
2121

2222
With batch inference, it's straightforward to scale offline inferences to large clusters of machines on terabytes of production data resulting in improved productivity and optimized cost.
2323

24-
In this how-to, you learn the following tasks:
24+
In this article, you learn the following tasks:
2525

2626
> * Create a remote compute resource.
2727
> * Write a custom inference script.
@@ -185,7 +185,7 @@ model = Model.register(model_path="models/",
185185
The script *must contain* two functions:
186186
- `init()`: Use this function for any costly or common preparation for later inference. For example, use it to load the model into a global object.
187187
- `run(mini_batch)`: The function will run for each `mini_batch` instance.
188-
- `mini_batch`: Batch inference will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a filepath if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
188+
- `mini_batch`: Batch inference will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in min_batch will be - a file path if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.
189189
- `response`: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful inference of input element in the input mini-batch. User should make sure that enough data is included in inference result to map input to inference. Inference output will be written in output file and not guaranteed to be in order, user should use some key in the output to map it to input.
190190

191191
```python
@@ -233,6 +233,15 @@ def run(mini_batch):
233233
return resultList
234234
```
235235

236+
### How to access other files in `init()` or `run()` functions
237+
238+
If you have another file or folder in the same directory as your inference script, you can reference it by finding the current working directory.
239+
240+
```python
241+
script_dir = os.path.realpath(os.path.join(__file__, '..',))
242+
file_path = os.path.join(script_dir, "<file_name>")
243+
```
244+
236245
## Build and run the batch inference pipeline
237246

238247
Now you have everything you need to build the pipeline.
@@ -257,11 +266,11 @@ batch_env.spark.precache_packages = False
257266

258267
### Specify the parameters for your batch inference pipeline step
259268

260-
`ParallelRunConfig` is the major configuration for the newly introduced batch inference `ParallelRunStep` instance within the Azure Machine Learning pipeline. You use it to wrap your script and configure necessary parameters, including all of the following:
269+
`ParallelRunConfig` is the major configuration for the newly introduced batch inference `ParallelRunStep` instance within the Azure Machine Learning pipeline. You use it to wrap your script and configure necessary parameters, including all of the following parameters:
261270
- `entry_script`: A user script as a local file path that will be run in parallel on multiple nodes. If `source_directly` is present, use a relative path. Otherwise, use any path that's accessible on the machine.
262271
- `mini_batch_size`: The size of the mini-batch passed to a single `run()` call. (Optional; the default value is `1`.)
263272
- For `FileDataset`, it's the number of files with a minimum value of `1`. You can combine multiple files into one mini-batch.
264-
- For `TabularDataset`, it's the size of data. Example values are `1024`, `1024KB`, `10MB`, and `1GB`. The recommended value is `1MB`. Note that the mini-batch from `TabularDataset` will never cross file boundaries. For example, if you have .csv files with various sizes, the smallest file is 100 KB and the largest is 10 MB. If you set `mini_batch_size = 1MB`, then files with a size smaller than 1 MB will be treated as one mini-batch. Files with a size larger than 1 MB will be split into multiple mini-batches.
273+
- For `TabularDataset`, it's the size of data. Example values are `1024`, `1024KB`, `10MB`, and `1GB`. The recommended value is `1MB`. The mini-batch from `TabularDataset` will never cross file boundaries. For example, if you have .csv files with various sizes, the smallest file is 100 KB and the largest is 10 MB. If you set `mini_batch_size = 1MB`, then files with a size smaller than 1 MB will be treated as one mini-batch. Files with a size larger than 1 MB will be split into multiple mini-batches.
265274
- `error_threshold`: The number of record failures for `TabularDataset` and file failures for `FileDataset` that should be ignored during processing. If the error count for the entire input goes above this value, the job will be stopped. The error threshold is for the entire input and not for individual mini-batches sent to the `run()` method. The range is `[-1, int.max]`. The `-1` part indicates ignoring all failures during processing.
266275
- `output_action`: One of the following values indicates how the output will be organized:
267276
- `summary_only`: The user script will store the output. `ParallelRunStep` will use the output only for the error threshold calculation.
@@ -345,6 +354,8 @@ pipeline_run.wait_for_completion(show_output=True)
345354

346355
To see this process working end to end, try the [batch inference notebook](https://aka.ms/batch-inference-notebooks).
347356

357+
For debugging and troubleshooting guidance for ParallelRunStep, see the [how-to guide](how-to-debug-batch-predictions.md).
358+
348359
For debugging and troubleshooting guidance for pipelines, see the [how-to guide](how-to-debug-pipelines.md).
349360

350361
[!INCLUDE [aml-clone-in-azure-notebook](../../../includes/aml-clone-for-examples.md)]

articles/machine-learning/service/toc.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -301,11 +301,14 @@
301301
displayName: create client consume request response synchronous
302302
href: how-to-consume-web-service.md
303303
- name: Run batch predictions
304-
displayName: score scoring asynchronous consume pipeline parallelrunstep inference
304+
displayName: score scoring batch consume pipeline parallelrunstep inference
305305
href: how-to-run-batch-predictions.md
306306
- name: Designer batch predictions
307307
displayName: score scoring asynchronous consume pipeline parallelrunstep inference designer
308308
href: how-to-run-batch-predictions-designer.md
309+
- name: Debug & troubleshoot batch predictions
310+
displayName: debug_batch consume pipeline parallelrunstep inference
311+
href: how-to-debug-batch-predictions.md
309312
- name: Monitor models
310313
items:
311314
- name: Collect & evaluate model data

0 commit comments

Comments
 (0)