You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/machine-learning/batch-inference/how-to-batch-scoring-script.md
+8-2Lines changed: 8 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -112,28 +112,34 @@ When writing scoring scripts that work with big amounts of data, you need to tak
112
112
113
113
Batch deployments distribute work at the file level, which means that a folder containing 100 files with mini-batches of 10 files will generate 10 batches of 10 files each. Notice that this will happen regardless of the size of the files involved. If your files are too big to be processed in large mini-batches we suggest to either split the files in smaller files to achieve a higher level of parallelism or to decrease the number of files per mini-batch. At this moment, batch deployment can't account for skews in the file's size distribution.
114
114
115
-
### Running inference at the batch level, the file level or the row level
115
+
### Running inference at the mini-batch, file or the row level
116
116
117
117
Batch endpoints will call the `run()` function in your scoring script once per mini-batch. However, you will have the power to decide if you want to run the inference over the entire batch, over one file at a time, or over one row at a time (if your data happens to be tabular).
118
118
119
-
#### Batch level
119
+
#### Mini-batch level
120
120
121
121
You will typically want to run inference over the batch all at once when you want to achieve high throughput in your batch scoring process. This is the case for instance if you run inference over a GPU where you want to achieve saturation of the inference device. You may also be relying on a data loader that can handle the batching itself if data doesn't fit on memory, like `TensorFlow` or `PyTorch` data loaders. On those cases, you may want to consider running inference on the entire batch.
122
122
123
123
> [!WARNING]
124
124
> Running inference at the batch level may require having high control over the input data size to be able to correctly account for the memory requirements and avoid out of memory exceptions. Whether you are able or not of loading the entire mini-batch in memory will depend on the size of the mini-batch, the size of the instances in the cluster, the number of workers on each node, and the size of the mini-batch.
125
125
126
+
For an example about how to achieve it see [High throughput deployments](how-to-image-processing-batch.md#high-throughput-deployments).
127
+
126
128
#### File level
127
129
128
130
One of the easiest ways to perform inference is by iterating over all the files in the mini-batch and run your model over it. In some cases, like image processing, this may be a good idea. If your data is tabular, you may need to make a good estimation about the number of rows on each file to estimate if your model is able to handle the memory requirements to not just load the entire data into memory but also to perform inference over it. Remember that some models (specially those based on recurrent neural networks) will unfold and present a memory footprint that may not be linear with the number of rows. If your model is expensive in terms of memory, please consider running inference at the row level.
129
131
130
132
> [!TIP]
131
133
> If file sizes are too big to be readed even at once, please consider breaking down files into multiple smaller files to account for better parallelization.
132
134
135
+
For an example about how to achieve it see [Image processing with batch deployments](how-to-image-processing-batch.md).
136
+
133
137
#### Row level (tabular)
134
138
135
139
For models that present challenges in the size of their inputs, you may want to consider running inference at the row level. Your batch deployment will still provide your scoring script with a mini-batch of files, however, you will read one file, one row at a time. This may look inefficient but for some deep learning models may be the only way to perform inference without scaling up your hardware requirements.
136
140
141
+
For an example about how to achieve it see [Text processing with batch deployments](how-to-nlp-processing-batch.md).
142
+
137
143
### Relationship between the degree of parallelism and the scoring script
138
144
139
145
Your deployment configuration controls the size of each mini-batch and the number of workers on each node. Take into account them when deciding if you want to read the entire mini-batch to perform inference. When running multiple workers on the same instance, take into account that memory will be shared across all the workers. Usually, increasing the number of workers per node should be accompanied by a decrease in the mini-batch size or by a change in the scoring strategy (if data size remains the same).
0 commit comments