You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/v1/how-to-debug-parallel-run-step.md
+14-14Lines changed: 14 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -103,12 +103,12 @@ from <your_package> import <your_class>
103
103
-`entry_script`: A user script as a local file path that to be run in parallel on multiple nodes. If `source_directory` is present, relative path should be used. Otherwise, use any path that's accessible on the machine.
104
104
-`mini_batch_size`: The size of the mini-batch passed to a single `run()` call. (optional; the default value is `10` files for `FileDataset` and `1MB` for `TabularDataset`.)
105
105
- For `FileDataset`, it's the number of files with a minimum value of `1`. You can combine multiple files into one mini-batch.
106
-
- For `TabularDataset`, it's the size of data. Example values are `1024`, `1024KB`, `10MB`, and `1GB`. The recommended value is `1MB`. The mini-batch from `TabularDataset` will never cross file boundaries. For example, if there are multiple .csv files with various sizes, the smallest one is 100 KB and the largest is 10 MB. If `mini_batch_size = 1MB` is set, the files smaller than 1 MB will be treated as one mini-batch and the files larger than 1 MB will be splitted into multiple mini-batches.
106
+
- For `TabularDataset`, it's the size of data. Example values are `1024`, `1024KB`, `10MB`, and `1GB`. The recommended value is `1MB`. The mini-batch from `TabularDataset` will never cross file boundaries. For example, if there are multiple .csv files with various sizes, the smallest one is 100 KB and the largest is 10 MB. If `mini_batch_size = 1MB` is set, the files smaller than 1 MB will be treated as one mini-batch and the files larger than 1 MB will be split into multiple mini-batches.
107
107
> [!NOTE]
108
108
> TabularDatasets backed by SQL cannot be partitioned.
109
109
> TabularDatasets from a single parquet file and single row group cannot be partitioned.
110
110
111
-
-`error_threshold`: The number of record failures for `TabularDataset` and file failures for `FileDataset` that should be ignored during processing. Once the error count for the entire input goes above this value, the job will be aborted. The error threshold is for the entire input and not for individual mini-batch sent to the `run()` method. The range is `[-1, int.max]`. The`-1` indicates ignoring all failures during processing.
111
+
-`error_threshold`: The number of record failures for `TabularDataset` and file failures for `FileDataset` that should be ignored during processing. Once the error count for the entire input goes above this value, the job is aborted. The error threshold is for the entire input and not for individual mini-batch sent to the `run()` method. The range is `[-1, int.max]`. `-1` indicates ignoring all failures during processing.
112
112
-`output_action`: One of the following values indicates how the output is organized:
113
113
-`summary_only`: The user script needs to store the output files. The outputs of `run()` are used for the error threshold calculation only.
114
114
-`append_row`: For all inputs, `ParallelRunStep` creates a single file in the output folder to append all outputs separated by line.
@@ -125,11 +125,11 @@ from <your_package> import <your_class>
125
125
You can specify `mini_batch_size`, `node_count`, `process_count_per_node`, `logging_level`, `run_invocation_timeout`, and `run_max_try` as `PipelineParameter`, so that when you resubmit a pipeline run, you can fine-tune the parameter values.
126
126
127
127
#### CUDA devices visibility
128
-
For compute targets equipped with GPUs, the environment variable `CUDA_VISIBLE_DEVICES` is set in worker processes. In AmlCompute, you can find the total number of GPU devices in the environment variable `AZ_BATCHAI_GPU_COUNT_FOUND`, which is set automatically. If you would like each worker process to have a dedicated GPU, set `process_count_per_node` equal to the number of GPU devices on a machine. Then, each worker process will be assigned with a unique index to `CUDA_VISIBLE_DEVICES`. When a worker process stops for any reason, the next started worker process will adopt the released GPU index.
128
+
For compute targets equipped with GPUs, the environment variable `CUDA_VISIBLE_DEVICES` is set in worker processes. In AmlCompute, you can find the total number of GPU devices in the environment variable `AZ_BATCHAI_GPU_COUNT_FOUND`, which is set automatically. If you would like each worker process to have a dedicated GPU, set `process_count_per_node` equal to the number of GPU devices on a machine. Then, each worker process get assigned with a unique index to `CUDA_VISIBLE_DEVICES`. When a worker process stops for any reason, the next started worker process will adopt the released GPU index.
129
129
130
-
When the total number of GPU devices is less than `process_count_per_node`, the worker processes will be assigned GPU index until it is used up.
130
+
When the total number of GPU devices is less than `process_count_per_node`, the worker processes with smaller index can be assigned GPU index until all GPUs have been occupied.
131
131
132
-
Given the total GPU devices is 2 and `process_count_per_node = 4` as an example, process 0 and process 1 will have index 0 and 1. Process 2 and 3 can not have the environment variable. For a library using this environment variable for GPU assignment, process 2 and 3 won't have GPUs and won't try to acquire GPU devices. Process 0 will release GPU index 0 when it stops. The next process if applicable, which is process 4, will have GPU index 0 assigned.
132
+
Given the total GPU devices is 2 and `process_count_per_node = 4` as an example, process 0 and process 1 will have index 0 and 1. Process 2 and 3 cannot have the environment variable. For a library using this environment variable for GPU assignment, process 2 and 3 won't have GPUs and won't try to acquire GPU devices. Process 0 will release GPU index 0 when it stops. The next process if applicable, which is process 4, will have GPU index 0 assigned.
133
133
134
134
For more information, see [CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES](https://developer.nvidia.com/blog/cuda-pro-tip-control-gpu-visibility-cuda_visible_devices/).
135
135
@@ -140,7 +140,7 @@ Create the ParallelRunStep by using the script, environment configuration, and p
140
140
-`parallel_run_config`: A `ParallelRunConfig` object, as defined earlier.
141
141
-`inputs`: One or more single-typed Azure Machine Learning datasets to be partitioned for parallel processing.
142
142
-`side_inputs`: One or more reference data or datasets used as side inputs without need to be partitioned.
143
-
-`output`: An `OutputFileDatasetConfig` object that represents the directory path at which the output data will be stored.
143
+
-`output`: An `OutputFileDatasetConfig` object that represents the directory path at which the output data should be stored.
144
144
-`arguments`: A list of arguments passed to the user script. Use unknown_args to retrieve them in your entry script (optional).
145
145
-`allow_reuse`: Whether the step should reuse previous results when run with the same settings/inputs. If this parameter is `False`, a new run will always be generated for this step during pipeline execution. (optional; the default value is `True`.)
146
146
@@ -164,7 +164,7 @@ For example, the log file `70_driver_log.txt` contains information from the cont
164
164
165
165
Because of the distributed nature of ParallelRunStep jobs, there are logs from several different sources. However, two consolidated files are created that provide high-level information:
166
166
167
-
-`~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job failed, it will show the error message and where to start the troubleshooting.
167
+
-`~/logs/job_progress_overview.txt`: This file provides a high-level info about the number of mini-batches (also known as tasks) created so far and number of mini-batches processed so far. At this end, it shows the result of the job. If the job fails, it shows the error message and where to start the troubleshooting.
168
168
169
169
-`~/logs/sys/master_role.txt`: This file provides the principal node (also known as the orchestrator) view of the running job. Includes task creation, progress monitoring, the run result.
170
170
@@ -186,7 +186,7 @@ For more information on errors in your script, there is:
186
186
187
187
When you need a full understanding of how each node executed the score script, look at the individual process logs for each node. The process logs can be found in the `sys/node` folder, grouped by worker nodes:
188
188
189
-
-`~/logs/sys/node/<node_id>/<process_name>.txt`: This file provides detailed info about each mini-batch as it's picked up or completed by a worker. For each mini-batch, this file includes:
189
+
-`~/logs/sys/node/<node_id>/<process_name>.txt`: This file provides detailed info about each mini-batch that has been picked up or completed by a worker. For each mini-batch, this file includes:
190
190
191
191
- The IP address and the PID of the worker process.
192
192
- The total number of items, successfully processed items count, and failed item count.
@@ -205,7 +205,7 @@ You can also view the results of periodical checks of the resource usage for eac
205
205
206
206
## How do I log from my user script from a remote context?
207
207
208
-
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, we recommend using ParallelRunStep logger as shown below. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.
208
+
ParallelRunStep may run multiple processes on one node based on process_count_per_node. In order to organize logs from each process on node and combine print and log statement, ParallelRunStep logger shown as below is recommended. You get a logger from EntryScript and make the logs show up in **logs/user** folder in the portal.
209
209
210
210
**A sample entry script using the logger:**
211
211
```python
@@ -235,7 +235,7 @@ ParallelRunStep sets a handler on the root logger, which sinks the message to `l
235
235
`logging` defaults to `INFO` level. By default, levels below `INFO` won't show up, such as `DEBUG`.
236
236
237
237
## How could I write to a file to show up in the portal?
238
-
Files in `logs` folder will be uploaded and show up in the portal.
238
+
Files written to `/logs` folder will be uploaded and show up in the portal.
239
239
You can get the folder `logs/user/entry_script_log/<node_id>` like below and compose your file path to write:
240
240
241
241
```python
@@ -256,7 +256,7 @@ def init():
256
256
## How to handle log in new processes?
257
257
You can spawn new processes in your entry script with [`subprocess`](https://docs.python.org/3/library/subprocess.html) module, connect to their input/output/error pipes and obtain their return codes.
258
258
259
-
The recommended approach is to use the [`run()`](https://docs.python.org/3/library/subprocess.html#subprocess.run) function with `capture_output=True`. Errors will show up in `logs/user/error/<node_id>/<process_name>.txt`.
259
+
The recommended approach is to use the [`run()`](https://docs.python.org/3/library/subprocess.html#subprocess.run) function with `capture_output=True`. Errors show up in `logs/user/error/<node_id>/<process_name>.txt`.
260
260
261
261
If you would like to use `Popen()`, stdout/stderr should be redirect to files, like:
262
262
```python
@@ -286,7 +286,7 @@ def init():
286
286
> [!NOTE]
287
287
> A worker process runs "system" code and the entry script code in the same process.
288
288
>
289
-
> If no `stdout` or `stderr` specified, a subprocess created with `Popen()` in your entry script will inherit the setting of the worker process.
289
+
> If no `stdout` or `stderr` specified, the setting of the worker process will be inheritted by subprocesses created with `Popen()` in your entry script will.
290
290
>
291
291
> `stdout` will write to `~/logs/sys/node/<node_id>/processNNN.stdout.txt` and `stderr` to `~/logs/sys/node/<node_id>/processNNN.stderr.txt`.
292
292
@@ -307,7 +307,7 @@ def run(mini_batch):
307
307
(Path(output_dir) / res2).write...
308
308
```
309
309
310
-
## How can I pass a side input such as, a file or file(s) containing a lookup table, to all my workers?
310
+
## How can I pass a side input, such as a file or file(s) containing a lookup table, to all my workers?
311
311
312
312
User can pass reference data to script using side_inputs parameter of ParalleRunStep. All datasets provided as side_inputs will be mounted on each worker node. User can get the location of mount by passing argument.
313
313
@@ -370,7 +370,7 @@ Besides looking at the overall status of the StepRun, the count of scheduled/pro
370
370
You can go into `~/logs/sys/error` to see if there's any exception. If there is none, it's likely that your entry script is taking a long time, you can print out progress information in your code to locate the time-consuming part, or add `"--profiling_module", "cProfile"` to the `arguments` of `ParallelRunStep` to generate a profile file named as `<process_name>.profile` under `~/logs/sys/node/<node_id>` folder.
371
371
372
372
### When will a job stop?
373
-
if not canceled, the job will stop with status:
373
+
If not canceled, the job may stop with status:
374
374
- Completed. If all mini-batches have been processed and output has been generated for `append_row` mode.
375
375
- Failed. If `error_threshold` in [`Parameters for ParallelRunConfig`](#parameters-for-parallelrunconfig) is exceeded, or system error occurs during the job.
0 commit comments