Skip to content

Commit 7ab2854

Browse files
authored
Merge pull request #279052 from msakande/freshness-concept-endpoints
freshness review - endpoints concept
2 parents 16a9d5c + 9229cbc commit 7ab2854

File tree

1 file changed

+32
-32
lines changed

1 file changed

+32
-32
lines changed

articles/machine-learning/concept-endpoints.md

Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -14,56 +14,58 @@ ms.custom:
1414
- devplatv2
1515
- ignite-2023
1616
- build-2024
17-
ms.date: 07/12/2023
17+
ms.date: 06/21/2024
1818
#Customer intent: As an MLOps administrator, I want to understand what a managed endpoint is and why I need it.
1919
---
2020

2121
# Endpoints for inference in production
2222

2323
[!INCLUDE [dev v2](includes/machine-learning-dev-v2.md)]
2424

25-
After you train machine learning models or pipelines, or you found models from our model catalog that suit your needs, you need to deploy them to production so that others can use them for _inference_. Inference is the process of applying new input data to the machine learning model or pipeline to generate outputs. While these outputs are typically referred to as "predictions," inferencing can be used to generate outputs for other machine learning tasks, such as classification and clustering. In Azure Machine Learning, you perform inferencing by using __endpoints__.
25+
Once you've trained machine learning models or pipelines, or you've found models from the model catalog that suit your needs, you need to deploy them to production so that others can use them for _inference_. Inference is the process of applying new input data to a machine learning model or pipeline to generate outputs. While these outputs are typically referred to as "predictions," inferencing can be used to generate outputs for other machine learning tasks, such as classification and clustering. In Azure Machine Learning, you perform inferencing by using __endpoints__.
2626

2727
## Endpoints and deployments
2828

29-
An **endpoint** is a stable and durable URL that can be used to request or invoke a model. You provide the required inputs to the endpoint and get the outputs back. An endpoint provides:
29+
An **endpoint** is a stable and durable URL that can be used to request or invoke a model. You provide the required inputs to the endpoint and get the outputs back. Azure Machine Learning allows you to implement serverless API endpoints, online endpoints, and batch endpoints. An endpoint provides:
3030

3131
- a stable and durable URL (like _endpoint-name.region.inference.ml.azure.com_),
3232
- an authentication mechanism, and
3333
- an authorization mechanism.
3434

35-
A **deployment** is a set of resources and computes required for hosting the model or component that does the actual inferencing. A single endpoint can contain one or several deployments (except for [serverless API](#endpoints-serverless-api-online-and-batch) endpoints). The deployments can host independent assets and consume different resources based on the needs of the assets. Furthermore, endpoints have a routing mechanism that can direct requests to specific deployments in the endpoint.
35+
A **deployment** is a set of resources and computes required for hosting the model or component that does the actual inferencing. An endpoint contains a deployment, and for online and batch endpoints, one endpoint can contain several deployments. The deployments can host independent assets and consume different resources, based on the needs of the assets. Furthermore, an endpoint has a routing mechanism that can direct requests to any of its deployments.
3636

37-
Some types of endpoints in Azure Machine Learning consume dedicated resources on their deployments. For these endpoints to run, you must have compute quota on your subscription. However, certain models support a serverless deployment—consuming no quota from your subscription—instead, you're billed based on usage.
37+
On one hand, some types of endpoints in Azure Machine Learning consume dedicated resources on their deployments. For these endpoints to run, you must have compute quota on your Azure subscription. On the other hand, certain models support a serverless deployment—allowing them to consume no quota from your subscription. For serverless deployment, you're billed based on usage.
3838

3939
### Intuition
4040

4141
Suppose you're working on an application that predicts the type and color of a car, given its photo. For this application, a user with certain credentials makes an HTTP request to a URL and provides a picture of a car as part of the request. In return, the user gets a response that includes the type and color of the car as string values. In this scenario, the URL serves as an __endpoint__.
4242

43-
:::image type="content" source="media/concept-endpoints/concept-endpoint.png" alt-text="A diagram showing the concept of an endpoint.":::
43+
:::image type="content" source="media/concept-endpoints/concept-endpoint.png" alt-text="A diagram showing the concept of an endpoint." border="false":::
4444

4545
Furthermore, say that a data scientist, Alice, is working on implementing the application. Alice knows a lot about TensorFlow and decides to implement the model using a Keras sequential classifier with a RestNet architecture from the TensorFlow Hub. After testing the model, Alice is happy with its results and decides to use the model to solve the car prediction problem. The model is large in size and requires 8 GB of memory with 4 cores to run. In this scenario, Alice's model and the resources, such as the code and the compute, that are required to run the model make up a __deployment under the endpoint__.
4646

47-
:::image type="content" source="media/concept-endpoints/concept-deployment.png" alt-text="A diagram showing the concept of a deployment.":::
47+
:::image type="content" source="media/concept-endpoints/concept-deployment.png" alt-text="A diagram showing the concept of a deployment." border="false":::
4848

4949
Let's imagine that after a couple of months, the organization discovers that the application performs poorly on images with less than ideal illumination conditions. Bob, another data scientist, knows a lot about data augmentation techniques that help a model build robustness on that factor. However, Bob feels more comfortable using Torch to implement the model and trains a new model with Torch. Bob wants to try this model in production gradually until the organization is ready to retire the old model. The new model also shows better performance when deployed to GPU, so the deployment needs to include a GPU. In this scenario, Bob's model and the resources, such as the code and the compute, that are required to run the model make up __another deployment under the same endpoint__.
5050

51-
:::image type="content" source="media/concept-endpoints/concept-deployment-routing.png" alt-text="A diagram showing the concept of an endpoint with multiple deployments.":::
51+
:::image type="content" source="media/concept-endpoints/concept-deployment-routing.png" alt-text="A diagram showing the concept of an endpoint with multiple deployments." border="false":::
5252

5353
## Endpoints: serverless API, online, and batch
5454

55-
Azure Machine Learning allows you to implement [serverless API endpoints](how-to-deploy-models-serverless.md), [online endpoints](concept-endpoints-online.md), and [batch endpoints](concept-endpoints-batch.md). Serverless API endpoints and online endpoints are designed for real-time inference—when you invoke the endpoint, the results are returned in the endpoint's response. Serverless API endpoints don't consume quota from your subscription; rather they're billed with pay-as-you-go billing.
55+
Azure Machine Learning allows you to implement [serverless API endpoints](how-to-deploy-models-serverless.md), [online endpoints](concept-endpoints-online.md), and [batch endpoints](concept-endpoints-batch.md).
5656

57-
Batch endpoints, on the other hand, are designed for long-running batch inference. Each time you invoke a batch endpoint you generate a batch job that performs the actual work.
57+
_Serverless API endpoints_ and _online endpoints_ are designed for real-time inference. Whenever you invoke the endpoint, the results are returned in the endpoint's response. Serverless API endpoints don't consume quota from your subscription; rather, they're billed with pay-as-you-go billing.
58+
59+
_Batch endpoints_ are designed for long-running batch inference. Whenever you invoke a batch endpoint, you generate a batch job that performs the actual work.
5860

5961
### When to use serverless API, online, and batch endpoints
6062

6163
__Serverless API endpoints__:
6264

63-
Use [serverless API endpoints](how-to-deploy-models-serverless.md) to consume large foundational models for real-time inferencing off-the-shelf or for fine-tuning it such models. Not all models are available for deployment to serverless API endpoints. We recommend using this deployment mode when:
65+
Use [serverless API endpoints](how-to-deploy-models-serverless.md) to consume large foundational models for real-time inferencing off-the-shelf or for fine-tuning such models. Not all models are available for deployment to serverless API endpoints. We recommend using this deployment mode when:
6466

6567
> [!div class="checklist"]
66-
> * Your model is a foundational model, or a fine-tuned version of it is available for serverless API deployments.
68+
> * Your model is a foundational model or a fine-tuned version of a foundational model that is available for serverless API deployments.
6769
> * You can benefit from a quota-less deployment.
6870
> * You don't need to customize the inferencing stack used to run the model.
6971
@@ -72,7 +74,7 @@ __Online endpoints__:
7274
Use [online endpoints](concept-endpoints-online.md) to operationalize models for real-time inference in synchronous low-latency requests. We recommend using them when:
7375

7476
> [!div class="checklist"]
75-
> * Your model is a foundational model or a fine-tuned version of it, but it's not supported in serverless API endpoints.
77+
> * Your model is a foundational model or a fine-tuned version of a foundational model, but it's not supported in serverless API endpoints.
7678
> * You have low-latency requirements.
7779
> * Your model can answer the request in a relatively short amount of time.
7880
> * Your model's inputs fit on the HTTP payload of the request.
@@ -92,11 +94,11 @@ Use [batch endpoints](concept-endpoints-batch.md) to operationalize models or pi
9294
9395
### Comparison of serverless API, online, and batch endpoints
9496

95-
All serverless API, online, and batch endpoints are based on the idea of endpoints, which help you transition easily from one to the other. Online and batch endpoints also introduce the capability of managing multiple deployments for the same endpoint. The following section explains the different features of each deployment option:
97+
All serverless API, online, and batch endpoints are based on the idea of endpoints, therefore, you can transition easily from one to the other. Online and batch endpoints are also capable of managing multiple deployments for the same endpoint.
9698

9799
#### Endpoints
98100

99-
The following table shows a summary of the different features available to serverless API, online, and batch endpoints.
101+
The following table shows a summary of the different features available to serverless API, online, and batch endpoints at the endpoint level.
100102

101103
| Feature | [Serverless API endpoints](how-to-deploy-models-serverless.md) | [Online endpoints](concept-endpoints-online.md) | [Batch endpoints](concept-endpoints-batch.md) |
102104
|---------------------------------------|--------------------------------------------------|-------------------------------------------------|-----------------------------------------------|
@@ -111,7 +113,7 @@ The following table shows a summary of the different features available to serve
111113
| Customer-managed keys | NA | Yes | Yes |
112114
| Cost basis | Per endpoint, per minute<sup>1</sup> | None | None |
113115

114-
<sup>1</sup>An small fraction is charged for serverless API endpoints per minute. See the [deployments](#deployments) section for the charges related to consumption, which are billed per token.
116+
<sup>1</sup>A small fraction is charged for serverless API endpoints per minute. See the [deployments](#deployments) section for the charges related to consumption, which are billed per token.
115117

116118

117119
#### Deployments
@@ -123,30 +125,30 @@ The following table shows a summary of the different features available to serve
123125
| Deployment types | Models | Models | Models and Pipeline components |
124126
| MLflow model deployment | No, only specific models in the catalog | Yes | Yes |
125127
| Custom model deployment | No, only specific models in the catalog | Yes, with scoring script | Yes, with scoring script |
126-
| Model package deployment <sup>1</sup> | Built-in | Yes (preview) | No |
127-
| Inference server <sup>2</sup> | Azure AI Model Inference API | - Azure Machine Learning Inferencing Server<br /> - Triton<br /> - Custom (using BYOC) | Batch Inference |
128+
| Model package deployment <sup>2</sup> | Built-in | Yes (preview) | No |
129+
| Inference server <sup>3</sup> | Azure AI Model Inference API | - Azure Machine Learning Inferencing Server<br /> - Triton<br /> - Custom (using BYOC) | Batch Inference |
128130
| Compute resource consumed | None (serverless) | Instances or granular resources | Cluster instances |
129131
| Compute type | None (serverless) | Managed compute and Kubernetes | Managed compute and Kubernetes |
130132
| Low-priority compute | NA | No | Yes |
131133
| Scaling compute to zero | Built-in | No | Yes |
132-
| Autoscaling compute<sup>3</sup> | Built-in | Yes, based on resources' load | Yes, based on job count |
134+
| Autoscaling compute<sup>4</sup> | Built-in | Yes, based on resource use | Yes, based on job count |
133135
| Overcapacity management | Throttling | Throttling | Queuing |
134-
| Cost basis<sup>4</sup> | Per tokens | Per deployment: compute instances running | Per job: compute instanced consumed in the job (capped to the maximum number of instances of the cluster). |
136+
| Cost basis<sup>5</sup> | Per token | Per deployment: compute instances running | Per job: compute instanced consumed in the job (capped to the maximum number of instances of the cluster) |
135137
| Local testing of deployments | No | Yes | No |
136138

137-
<sup>1</sup> Deploying MLflow models to endpoints without outbound internet connectivity or private networks requires [packaging the model](concept-package-models.md) first.
139+
<sup>2</sup> Deploying MLflow models to endpoints without outbound internet connectivity or private networks requires [packaging the model](concept-package-models.md) first.
138140

139-
<sup>2</sup> *Inference server* refers to the serving technology that takes requests, processes them, and creates responses. The inference server also dictates the format of the input and the expected outputs.
141+
<sup>3</sup> *Inference server* refers to the serving technology that takes requests, processes them, and creates responses. The inference server also dictates the format of the input and the expected outputs.
140142

141-
<sup>3</sup> *Autoscaling* is the ability to dynamically scale up or scale down the deployment's allocated resources based on its load. Online and batch deployments use different strategies for autoscaling. While online deployments scale up and down based on the resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down based on the number of jobs created.
143+
<sup>4</sup> *Autoscaling* is the ability to dynamically scale up or scale down the deployment's allocated resources based on its load. Online and batch deployments use different strategies for autoscaling. While online deployments scale up and down based on the resource utilization (like CPU, memory, requests, etc.), batch endpoints scale up or down based on the number of jobs created.
142144

143-
<sup>4</sup> Both online and batch deployments charge by the resources consumed. In online deployments, resources are provisioned at deployment time. However, in batch deployment, no resources are consumed at deployment time but when the job runs. Hence, there's no cost associated with the deployment itself. Notice that queued jobs don't consume resources either.
145+
<sup>5</sup> Both online and batch deployments charge by the resources consumed. In online deployments, resources are provisioned at deployment time. In batch deployment, resources aren't consumed at deployment time but at the time that the job runs. Hence, there's no cost associated with the batch deployment itself. Likewise, queued jobs don't consume resources either.
144146

145147
## Developer interfaces
146148

147-
Endpoints are designed to help organizations operationalize production-level workloads in Azure Machine Learning. Endpoints are robust and scalable resources and they provide the best of the capabilities to implement MLOps workflows.
149+
Endpoints are designed to help organizations operationalize production-level workloads in Azure Machine Learning. Endpoints are robust and scalable resources, and they provide the best capabilities to implement MLOps workflows.
148150

149-
You can create and manage batch and online endpoints with multiple developer tools:
151+
You can create and manage batch and online endpoints with several developer tools:
150152

151153
- The Azure CLI and the Python SDK
152154
- Azure Resource Manager/REST API
@@ -155,11 +157,9 @@ You can create and manage batch and online endpoints with multiple developer too
155157
- Support for CI/CD MLOps pipelines using the Azure CLI interface & REST/ARM interfaces
156158

157159

158-
## Next steps
160+
## Related content
159161

160-
- [How to deploy online endpoints with the Azure CLI and Python SDK](how-to-deploy-online-endpoints.md)
161-
- [How to deploy models with batch endpoints](how-to-use-batch-model-deployments.md)
162+
- [Deploy and score a machine learning model by using an online endpoint](how-to-deploy-online-endpoints.md)
163+
- [Deploy models for scoring in batch endpoints](how-to-use-batch-model-deployments.md)
162164
- [How to deploy pipelines with batch endpoints](how-to-use-batch-pipeline-deployments.md)
163-
- [How to use online endpoints with the studio](how-to-use-managed-online-endpoint-studio.md)
164-
- [How to monitor managed online endpoints](how-to-monitor-online-endpoints.md)
165-
- [Manage and increase quotas for resources with Azure Machine Learning](how-to-manage-quotas.md#azure-machine-learning-online-endpoints-and-batch-endpoints)
165+
- [How to monitor managed online endpoints](how-to-monitor-online-endpoints.md)

0 commit comments

Comments
 (0)