Skip to content

Commit 94aa832

Browse files
Merge pull request #6570 from s-polly/main
Endpoints for inference - freshness
2 parents 3f0dfd3 + c664485 commit 94aa832

File tree

1 file changed

+24
-24
lines changed

1 file changed

+24
-24
lines changed

articles/machine-learning/concept-endpoints.md

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -13,60 +13,60 @@ ms.custom:
1313
- devplatv2
1414
- ignite-2023
1515
- build-2024
16-
ms.date: 06/21/2024
16+
ms.date: 08/13/2025
1717
#Customer intent: As an MLOps administrator, I want to understand what a managed endpoint is and why I need it.
1818
---
1919

2020
# Endpoints for inference in production
2121

2222
[!INCLUDE [dev v2](includes/machine-learning-dev-v2.md)]
2323

24-
Once you've trained machine learning models or pipelines, or you've found models from the model catalog that suit your needs, you need to deploy them to production so that others can use them for _inference_. Inference is the process of applying new input data to a machine learning model or pipeline to generate outputs. While these outputs are typically referred to as "predictions," inferencing can be used to generate outputs for other machine learning tasks, such as classification and clustering. In Azure Machine Learning, you perform inferencing by using __endpoints__.
24+
After you train machine learning models or pipelines, or find suitable models from the model catalog, you need to deploy them to production for others to use for _inference_. Inference is the process of applying new input data to a machine learning model or pipeline to generate outputs. While these outputs are typically called "predictions," inference can generate outputs for other machine learning tasks, such as classification and clustering. In Azure Machine Learning, you perform inference by using __endpoints__.
2525

2626
## Endpoints and deployments
2727

28-
An **endpoint** is a stable and durable URL that can be used to request or invoke a model. You provide the required inputs to the endpoint and get the outputs back. Azure Machine Learning allows you to implement standard deployments, online endpoints, and batch endpoints. An endpoint provides:
28+
An **endpoint** is a stable and durable URL that can be used to request or invoke a model. You provide the required inputs to the endpoint and receive the outputs. Azure Machine Learning supports standard deployments, online endpoints, and batch endpoints. An endpoint provides:
2929

30-
- a stable and durable URL (like _endpoint-name.region.inference.ml.azure.com_),
31-
- an authentication mechanism, and
32-
- an authorization mechanism.
30+
- A stable and durable URL (such as _endpoint-name.region.inference.ml.azure.com_)
31+
- An authentication mechanism
32+
- An authorization mechanism
3333

34-
A **deployment** is a set of resources and computes required for hosting the model or component that does the actual inferencing. An endpoint contains a deployment, and for online and batch endpoints, one endpoint can contain several deployments. The deployments can host independent assets and consume different resources, based on the needs of the assets. Furthermore, an endpoint has a routing mechanism that can direct requests to any of its deployments.
34+
A **deployment** is a set of resources and compute required to host the model or component that performs the actual inference. An endpoint contains a deployment. For online and batch endpoints, one endpoint can contain several deployments. The deployments can host independent assets and consume different resources based on the needs of the assets. An endpoint also has a routing mechanism that can direct requests to any of its deployments.
3535

36-
On one hand, some types of endpoints in Azure Machine Learning consume dedicated resources on their deployments. For these endpoints to run, you must have compute quota on your Azure subscription. On the other hand, certain models support a serverless deployment—allowing them to consume no quota from your subscription. For serverless deployment, you're billed based on usage.
36+
Some types of endpoints in Azure Machine Learning consume dedicated resources on their deployments. For these endpoints to run, you must have compute quota on your Azure subscription. However, certain models support a serverless deployment, which allows them to consume no quota from your subscription. For serverless deployments, you're billed based on usage.
3737

3838
### Intuition
3939

40-
Suppose you're working on an application that predicts the type and color of a car, given its photo. For this application, a user with certain credentials makes an HTTP request to a URL and provides a picture of a car as part of the request. In return, the user gets a response that includes the type and color of the car as string values. In this scenario, the URL serves as an __endpoint__.
40+
Suppose you're working on an application that predicts the type and color of a car from a photo. For this application, a user with certain credentials makes an HTTP request to a URL and provides a picture of a car as part of the request. In return, the user receives a response that includes the type and color of the car as string values. In this scenario, the URL serves as an __endpoint__.
4141

4242
:::image type="content" source="media/concept-endpoints/concept-endpoint.png" alt-text="A diagram showing the concept of an endpoint." border="false":::
4343

44-
Furthermore, say that a data scientist, Alice, is working on implementing the application. Alice knows a lot about TensorFlow and decides to implement the model using a Keras sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing the model, Alice is happy with its results and decides to use the model to solve the car prediction problem. The model is large in size and requires 8 GB of memory with 4 cores to run. In this scenario, Alice's model and the resources, such as the code and the compute, that are required to run the model make up a __deployment under the endpoint__.
44+
Now suppose that a data scientist, Alice, is implementing the application. Alice has extensive TensorFlow experience and decides to implement the model using a Keras sequential classifier with a ResNet architecture from the TensorFlow Hub. After testing the model, Alice is satisfied with its results and decides to use the model to solve the car prediction problem. The model is large and requires 8 GB of memory with 4 cores to run. In this scenario, Alice's model and the resourcessuch as the code and the computethat are required to run the model make up a __deployment under the endpoint__.
4545

4646
:::image type="content" source="media/concept-endpoints/concept-deployment.png" alt-text="A diagram showing the concept of a deployment." border="false":::
4747

48-
Let's imagine that after a couple of months, the organization discovers that the application performs poorly on images with less than ideal illumination conditions. Bob, another data scientist, knows a lot about data augmentation techniques that help a model build robustness on that factor. However, Bob feels more comfortable using Torch to implement the model and trains a new model with Torch. Bob wants to try this model in production gradually until the organization is ready to retire the old model. The new model also shows better performance when deployed to GPU, so the deployment needs to include a GPU. In this scenario, Bob's model and the resources, such as the code and the compute, that are required to run the model make up __another deployment under the same endpoint__.
48+
After a few months, the organization discovers that the application performs poorly on images with poor lighting conditions. Bob, another data scientist, has expertise in data augmentation techniques that help models build robustness for this factor. However, Bob prefers using PyTorch to implement the model and trains a new model with PyTorch. Bob wants to test this model in production gradually until the organization is ready to retire the old model. The new model also performs better when deployed to GPU, so the deployment needs to include a GPU. In this scenario, Bob's model and the resourcessuch as the code and the computethat are required to run the model make up __another deployment under the same endpoint__.
4949

5050
:::image type="content" source="media/concept-endpoints/concept-deployment-routing.png" alt-text="A diagram showing the concept of an endpoint with multiple deployments." border="false":::
5151

5252
## Endpoints: standard deployment, online, and batch
5353

54-
Azure Machine Learning allows you to implement [standard deployments](how-to-deploy-models-serverless.md), [online endpoints](concept-endpoints-online.md), and [batch endpoints](concept-endpoints-batch.md).
54+
Azure Machine Learning supports [standard deployments](how-to-deploy-models-serverless.md), [online endpoints](concept-endpoints-online.md), and [batch endpoints](concept-endpoints-batch.md).
5555

56-
_standard deployment_ and _online endpoints_ are designed for real-time inference. Whenever you invoke the endpoint, the results are returned in the endpoint's response. Standard deployments don't consume quota from your subscription; rather, they're billed with Standard billing.
56+
_Standard deployments_ and _online endpoints_ are designed for real-time inference. When you invoke the endpoint, the results are returned in the endpoint's response. Standard deployments don't consume quota from your subscription; instead, they're billed with standard billing.
5757

58-
_Batch endpoints_ are designed for long-running batch inference. Whenever you invoke a batch endpoint, you generate a batch job that performs the actual work.
58+
_Batch endpoints_ are designed for long-running batch inference. When you invoke a batch endpoint, you generate a batch job that performs the actual work.
5959

6060
### When to use standard deployment, online, and batch endpoints
6161

62-
__standard deployment__:
62+
__Standard deployment__:
6363

6464
Use [standard deployments](how-to-deploy-models-serverless.md) to consume large foundational models for real-time inferencing off-the-shelf or for fine-tuning such models. Not all models are available for deployment to standard deployments. We recommend using this deployment mode when:
6565

6666
> [!div class="checklist"]
6767
> * Your model is a foundational model or a fine-tuned version of a foundational model that is available for standard deployments.
6868
> * You can benefit from a quota-less deployment.
69-
> * You don't need to customize the inferencing stack used to run the model.
69+
> * You don't need to customize the inference stack used to run the model.
7070
7171
__Online endpoints__:
7272

@@ -76,7 +76,7 @@ Use [online endpoints](concept-endpoints-online.md) to operationalize models for
7676
> * Your model is a foundational model or a fine-tuned version of a foundational model, but it's not supported in standard deployment.
7777
> * You have low-latency requirements.
7878
> * Your model can answer the request in a relatively short amount of time.
79-
> * Your model's inputs fit on the HTTP payload of the request.
79+
> * Your model's inputs fit in the HTTP payload of the request.
8080
> * You need to scale up in terms of number of requests.
8181
8282
__Batch endpoints__:
@@ -93,11 +93,11 @@ Use [batch endpoints](concept-endpoints-batch.md) to operationalize models or pi
9393
9494
### Comparison of standard deployment, online, and batch endpoints
9595

96-
All standard deployment, online, and batch endpoints are based on the idea of endpoints, therefore, you can transition easily from one to the other. Online and batch endpoints are also capable of managing multiple deployments for the same endpoint.
96+
All standard deployments, online endpoints, and batch endpoints are based on the idea of endpoints, therefore, you can transition easily from one to the other. Online and batch endpoints are also capable of managing multiple deployments for the same endpoint.
9797

9898
#### Endpoints
9999

100-
The following table shows a summary of the different features available to standard deployment, online, and batch endpoints at the endpoint level.
100+
The following table shows a summary of the different features available to standard deployments, online endpoints, and batch endpoints at the endpoint level.
101101

102102
| Feature | [Standard deployments](how-to-deploy-models-serverless.md) | [Online endpoints](concept-endpoints-online.md) | [Batch endpoints](concept-endpoints-batch.md) |
103103
|---------------------------------------|--------------------------------------------------|-------------------------------------------------|-----------------------------------------------|
@@ -117,9 +117,9 @@ The following table shows a summary of the different features available to stand
117117

118118
#### Deployments
119119

120-
The following table shows a summary of the different features available to standard deployment, online, and batch endpoints at the deployment level. These concepts apply to each deployment under the endpoint (for online and batch endpoints), and apply to standard deployment (where the concept of deployment is built into the endpoint).
120+
The following table shows a summary of the different features available to standard deployments, online endpoints, and batch endpoints at the deployment level. These concepts apply to each deployment under the endpoint (for online and batch endpoints), and apply to standard deployments (where the concept of deployment is built into the endpoint).
121121

122-
| Feature | [Standard deployment](how-to-deploy-models-serverless.md) | [Online endpoints](concept-endpoints-online.md) | [Batch endpoints](concept-endpoints-batch.md) |
122+
| Feature | [Standard deployments](how-to-deploy-models-serverless.md) | [Online endpoints](concept-endpoints-online.md) | [Batch endpoints](concept-endpoints-batch.md) |
123123
|-------------------------------|-------------------------------------------------|-------------------------------------------------|-----------------------------------------------|
124124
| Deployment types | Models | Models | Models and Pipeline components |
125125
| MLflow model deployment | No, only specific models in the catalog | Yes | Yes |
@@ -131,23 +131,23 @@ The following table shows a summary of the different features available to stand
131131
| Scaling compute to zero | Built-in | No | Yes |
132132
| Autoscaling compute<sup>4</sup> | Built-in | Yes, based on resource use | Yes, based on job count |
133133
| Overcapacity management | Throttling | Throttling | Queuing |
134-
| Cost basis<sup>5</sup> | Per token | Per deployment: compute instances running | Per job: compute instanced consumed in the job (capped to the maximum number of instances of the cluster) |
134+
| Cost basis<sup>5</sup> | Per token | Per deployment: compute instances running | Per job: compute instances consumed in the job (capped to the maximum number of instances of the cluster) |
135135
| Local testing of deployments | No | Yes | No |
136136

137137

138138
<sup>2</sup> *Inference server* refers to the serving technology that takes requests, processes them, and creates responses. The inference server also dictates the format of the input and the expected outputs.
139139

140140
<sup>3</sup> *Autoscaling* is the ability to dynamically scale up or scale down the deployment's allocated resources based on its load. Online and batch deployments use different strategies for autoscaling. While online deployments scale up and down based on the resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down based on the number of jobs created.
141141

142-
<sup>4</sup> Both online and batch deployments charge by the resources consumed. In online deployments, resources are provisioned at deployment time. In batch deployment, resources aren't consumed at deployment time but at the time that the job runs. Hence, there's no cost associated with the batch deployment itself. Likewise, queued jobs don't consume resources either.
142+
<sup>4</sup> Both online and batch deployments charge by the resources consumed. In online deployments, resources are provisioned at deployment time. In batch deployments, resources aren't consumed at deployment time but at the time that the job runs. Hence, there's no cost associated with the batch deployment itself. Likewise, queued jobs don't consume resources either.
143143

144144
## Developer interfaces
145145

146146
Endpoints are designed to help organizations operationalize production-level workloads in Azure Machine Learning. Endpoints are robust and scalable resources, and they provide the best capabilities to implement MLOps workflows.
147147

148148
You can create and manage batch and online endpoints with several developer tools:
149149

150-
- The Azure CLI and the Python SDK
150+
- Azure CLI and Python SDK
151151
- Azure Resource Manager/REST API
152152
- Azure Machine Learning studio web portal
153153
- Azure portal (IT/Admin)

0 commit comments

Comments
 (0)