|
| 1 | +--- |
| 2 | +title: Debug and troubleshoot ParallelRunStep |
| 3 | +titleSuffix: Azure Machine Learning |
| 4 | +description: Debug and troubleshoot machine learning pipelines in the Azure Machine Learning SDK for Python. Learn common pitfalls for developing with pipelines, and tips to help you debug your scripts before and during remote execution. |
| 5 | +services: machine-learning |
| 6 | +ms.service: machine-learning |
| 7 | +ms.subservice: core |
| 8 | +ms.topic: conceptual |
| 9 | +ms.reviewer: trbye, jmartens, larryfr, vaidyas |
| 10 | +ms.author: trmccorm |
| 11 | +author: tmccrmck |
| 12 | +ms.date: 11/21/2019 |
| 13 | +--- |
| 14 | + |
| 15 | +# Debug and troubleshoot using ParallelRun |
| 16 | +[!INCLUDE [applies-to-skus](../../../includes/aml-applies-to-basic-enterprise-sku.md)] |
| 17 | + |
| 18 | +In this article, you learn how to debug and troubleshoot the [ParallelRunStep](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) class from the [Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml/intro?view=azure-ml-py). |
| 19 | + |
| 20 | +## Testing scripts locally |
| 21 | + |
| 22 | +See the [Testing scripts locally section](how-to-debug-pipelines.md#testing-scripts-locally) for machine learning pipelines. Your ParallelRunStep runs as a step in ML pipelines so the same answer applies to both. |
| 23 | + |
| 24 | +## Debugging scripts from remote context |
| 25 | + |
| 26 | +The transition from debugging a scoring script locally to debugging a scoring script in an actual pipeline can be a difficult leap. For information on finding your logs in the portal, the [machine learning pipelines section on debugging scripts from a remote context](how-to-debug-pipelines.md#debugging-scripts-from-remote-context). The information in that section also applies to a batch inference run. |
| 27 | + |
| 28 | +For example, the log file `70_driver_log.txt` also contains: |
| 29 | + |
| 30 | +* All printed statements during your script's execution. |
| 31 | +* The stack trace of the script. |
| 32 | + |
| 33 | +Because of the distributed nature of Batch inference jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information: |
| 34 | + |
| 35 | +- `~/logs/overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting. |
| 36 | + |
| 37 | +- `~/logs/master.txt`: This file provides the master node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run's result. |
| 38 | + |
| 39 | +When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `worker` folder, grouped by worker nodes: |
| 40 | + |
| 41 | +- `~/logs/worker/<ip_address>/Process-*.txt`: This file provides detailed info about each mini-batch as it is picked up or completed by a worker. For each mini-batch, this file includes: |
| 42 | + |
| 43 | + - The IP address and the PID of the worker process. |
| 44 | + - The total number of items and the number of successfully processed items. |
| 45 | + - The start and end time in wall-clock times (`start1` and `end1`). |
| 46 | + - The start and end time in processor time spent (`start2` and `end2`). |
| 47 | + |
| 48 | +You can also find information on the resource usage of the processes for each worker. This information is in CSV format, and is located at `~/logs/performance/<ip_address>/`. For example, when checking for resource utilization, look at the following files: |
| 49 | + |
| 50 | +- `process_resource_monitor_<ip>_<pid>.csv`: Per worker process resource usage. |
| 51 | +- `sys_resource_monitor_<ip>.csv`: Per node log. |
| 52 | + |
| 53 | +### How do I log from my user script from a remote context? |
| 54 | +You can set up a logger with the below steps to make the logs show up in **logs/users** folder in the portal: |
| 55 | +1. Save the first code section below into file entry_script_helper.py and put the file in same folder as your entry script. This class gets the path inside AmlCompute. For local test, you can change `get_working_dir()` to return a local folder. |
| 56 | +2. Config a logger in your `init()` method and then use it. The second code section below is an example. |
| 57 | + |
| 58 | +**entry_script_helper.py:** |
| 59 | +```python |
| 60 | +""" |
| 61 | +This module provides helper features for entry script. |
| 62 | +This file should be in Python search paths or in the same folder as the entry script. |
| 63 | +""" |
| 64 | +import os |
| 65 | +import socket |
| 66 | +import logging |
| 67 | +import time |
| 68 | +from multiprocessing import current_process |
| 69 | +from azureml.core import Run |
| 70 | + |
| 71 | + |
| 72 | +class EntryScriptHelper: |
| 73 | + """ A helper to provide common features for entry script.""" |
| 74 | + |
| 75 | + LOG_CONFIGED = False |
| 76 | + |
| 77 | + def get_logger(self, name="EntryScript"): |
| 78 | + """ Return a logger. |
| 79 | + The logger will write to the 'users' folder and show up in azure portal. |
| 80 | + """ |
| 81 | + return logging.getLogger(name) |
| 82 | + |
| 83 | + def config(self, name="EntryScript", level="INFO"): |
| 84 | + """ Config a logger. This should be called in init() in score module. |
| 85 | + Config the logger one time if not configed. |
| 86 | + The logger will write to the 'users' folder and show up in azure portal. |
| 87 | + """ |
| 88 | + logger = logging.getLogger(name) |
| 89 | + |
| 90 | + formatter = logging.Formatter( |
| 91 | + "%(asctime)s|%(name)s|%(levelname)s|%(process)d|%(thread)d|%(funcName)s()|%(message)s" |
| 92 | + ) |
| 93 | + formatter.converter = time.gmtime |
| 94 | + |
| 95 | + logger.setLevel(level) |
| 96 | + |
| 97 | + handler = logging.FileHandler(self.get_log_file_path()) |
| 98 | + handler.setLevel(level) |
| 99 | + handler.setFormatter(formatter) |
| 100 | + logger.addHandler(handler) |
| 101 | + |
| 102 | + return logger |
| 103 | + |
| 104 | + def get_log_file_path(self): |
| 105 | + """ Get the log file path for users. |
| 106 | + Each process has its own log file, so there is not race issue among multiple processes. |
| 107 | + """ |
| 108 | + ip_address = socket.gethostbyname(socket.gethostname()) |
| 109 | + log_dir = os.path.join(self.get_log_dir(), "user", ip_address) |
| 110 | + os.makedirs(log_dir, exist_ok=True) |
| 111 | + return os.path.join(log_dir, f"{current_process().name}.txt") |
| 112 | + |
| 113 | + def get_log_dir(self): |
| 114 | + """ Return the folder for logs. |
| 115 | + Files and folders in it will be uploaded and show up in run detail page in the azure portal. |
| 116 | + """ |
| 117 | + log_dir = os.path.join(self.get_working_dir(), "logs") |
| 118 | + os.makedirs(log_dir, exist_ok=True) |
| 119 | + return log_dir |
| 120 | + |
| 121 | + def get_working_dir(self): |
| 122 | + """ Return the working directory.""" |
| 123 | + return os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), self.get_run().id) |
| 124 | + |
| 125 | + def get_temp_dir(self): |
| 126 | + """ Return local temp directory.""" |
| 127 | + local_temp_dir = os.path.join( |
| 128 | + os.environ.get("AZ_BATCHAI_JOB_TEMP", ""), "azureml-bi", str(os.getpid()) |
| 129 | + ) |
| 130 | + os.makedirs(local_temp_dir, exist_ok=True) |
| 131 | + return local_temp_dir |
| 132 | + |
| 133 | + def get_run(self): |
| 134 | + """ Return the Run from the context.""" |
| 135 | + return Run.get_context(allow_offline=False) |
| 136 | + |
| 137 | +``` |
| 138 | + |
| 139 | +**A sample entry script using the logger:** |
| 140 | +```python |
| 141 | +""" |
| 142 | +This is a sample scoring module. |
| 143 | +
|
| 144 | +This module provides a sample which passes the input back without any change. |
| 145 | +""" |
| 146 | +import os |
| 147 | +import logging |
| 148 | +from entry_script_helper import EntryScriptHelper |
| 149 | + |
| 150 | +LOG_NAME = "score_file_list" |
| 151 | + |
| 152 | + |
| 153 | +def init(): |
| 154 | + """ Init """ |
| 155 | + EntryScriptHelper().config(LOG_NAME) |
| 156 | + logger = logging.getLogger(LOG_NAME) |
| 157 | + output_folder = os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), "temp/output") |
| 158 | + logger.info(f"{__file__}.output_folder:{output_folder}") |
| 159 | + logger.info("init()") |
| 160 | + os.makedirs(output_folder, exist_ok=True) |
| 161 | + |
| 162 | + |
| 163 | +def run(mini_batch): |
| 164 | + """ Accept and return the list back.""" |
| 165 | + logger = logging.getLogger(LOG_NAME) |
| 166 | + logger.info(f"{__file__}, run({mini_batch})") |
| 167 | + |
| 168 | + output_folder = os.path.join(os.environ.get("AZ_BATCHAI_INPUT_AZUREML", ""), "temp/output") |
| 169 | + for file_name in mini_batch: |
| 170 | + with open(file_name, "r") as file: |
| 171 | + lines = file.readlines() |
| 172 | + base_name = os.path.basename(file_name) |
| 173 | + name = os.path.join(output_folder, base_name) |
| 174 | + logger.info(f"{__file__}: {name}") |
| 175 | + with open(name, "w") as file: |
| 176 | + file.write(f"ouput file {name} from {__file__}:\n") |
| 177 | + for line in lines: |
| 178 | + file.write(line) |
| 179 | + |
| 180 | + return mini_batch |
| 181 | +``` |
| 182 | + |
| 183 | +## Next steps |
| 184 | + |
| 185 | +* See the SDK reference for help with the [azureml-contrib-pipeline-step](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps?view=azure-ml-py) package and the [documentation](https://docs.microsoft.com/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) for ParallelRunStep class. |
| 186 | + |
| 187 | +* Follow the [advanced tutorial](tutorial-pipeline-batch-scoring-classification.md) on using pipelines for batch scoring. |
0 commit comments