Skip to content

Commit f13e0dc

Browse files
committed
Update Azure API Guidelines for new LRO pattern.
1 parent 27aa5a2 commit f13e0dc

File tree

6 files changed

+247
-107
lines changed

6 files changed

+247
-107
lines changed

azure/ConsiderationsForServiceDesign.md

Lines changed: 186 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# Considerations for Service Design
22

3+
<!-- cspell:ignore autorest, etag, idempotency -->
4+
35
## History
46

57
| Date | Notes |
68
| ----------- | -------------------------------------------------------------- |
9+
| 2022-May-20 | Update guidance on long-running operations |
710
| 2022-Feb-01 | Updated error guidance |
811
| 2021-Sep-11 | Add long-running operations guidance |
912
| 2021-Aug-06 | Updated Azure REST Guidelines per Azure API Stewardship Board. |
@@ -26,7 +29,7 @@ It is critically important to design your service to avoid disrupting users as t
2629
:white_check_mark: **DO** ensure that customers are able to adopt a new version of service or SDK client library **without requiring code changes**
2730

2831
## Azure Management Plane vs Data Plane
29-
*Note: Developing a new service requires the development of at least 1 (management plane) API and potentially one or more additional (data plane) APIs. When reviewing v1 service APIs, we see common advice provided during the review.*
32+
_Note: Developing a new service requires the development of at least 1 (management plane) API and potentially one or more additional (data plane) APIs. When reviewing v1 service APIs, we see common advice provided during the review._
3033

3134
A **management plane** API is implemented through the Azure Resource Manager (ARM) and is used to provision and control the operational state of resources.
3235
A **data plane** API is used by developers to implement applications. Occasionally, some operations are useful for provisioning/control and applications. In this case, the operation can appear in both APIs.
@@ -37,7 +40,7 @@ Although, best practices and patterns described in this document apply to all HT
3740
A great API starts with a well thought out and designed service. Your service should define simple/understandable abstractions with each given a clear name that you use consistently throughout your API and documentation. There must also be an unambiguous relationship between these abstractions.
3841

3942
Follow these practices to create clear names for your abstractions:
40-
- Don't invent fancy terms or use fancy words. Try explaining the abstraction to someone that is not a domain expert and then name the abstraction using similar verbage.
43+
- Don't invent fancy terms or use fancy words. Try explaining the abstraction to someone that is not a domain expert and then name the abstraction using similar verbiage.
4144
- Don't include "throwaway" words in names, like "response", "object", "payload", etc.
4245
- Avoid generic names. Names should be specific to the abstraction and highlight how it is different from other abstractions in your service or related services.
4346
- Pick one word/term out of a set of synonyms and stick to it.
@@ -55,7 +58,7 @@ The whole purpose of a preview to address feedback by improving abstractions, na
5558
## Focus on Hero Scenarios
5659
It is important to realize that writing an API is, in many cases, the easiest part of providing a delightful developer experience. There are a large number of downstream activities for each API, e.g. testing, documentation, client libraries, examples, blog posts, videos, and supporting customers in perpetuity. In fact, implementing an API is of miniscule cost compared to all the other downstream activities.
5760

58-
*For this reason, it is **much better** to ship with fewer features and only add new features over time as required by customers.*
61+
_For this reason, it is **much better** to ship with fewer features and only add new features over time as required by customers._
5962

6063
Focusing on hero scenarios reduces development, support, and maintenance costs; enables teams to align and reach consensus faster; and accelerates the time to delivery. A telltale sign of a service that has not focused on hero scenarios is "API drift," where endpoints are inconsistent, incomplete, or juxtaposed to one another.
6164

@@ -83,7 +86,7 @@ Before releasing your API plan to invest significant design effort, get customer
8386

8487
:ballot_box_with_check: **YOU SHOULD** identify key scenarios or design decisions in your API that you want to test with customers, and ask customers for feedback and to share relevant code samples.
8588

86-
:ballot_box_with_check: **YOU SHOULD** consider doing a *code with* exercise in which you actively develop with the customer, observing and learning from their API usage.
89+
:ballot_box_with_check: **YOU SHOULD** consider doing a _code with_ exercise in which you actively develop with the customer, observing and learning from their API usage.
8790

8891
:ballot_box_with_check: **YOU SHOULD** capture what you have learned during the preview stage and share these findings with your team and with the API Stewardship Board.
8992

@@ -141,94 +144,205 @@ cannot collide with a resource path that contains user-specified resource ids.
141144
Long-running operations are an API design pattern that should be used when the processing of
142145
an operation may take a significant amount of time -- longer than a client will want to block
143146
waiting for the result.
144-
Azure allows for two forms of this design pattern: resource-based long-running operations (RELO),
145-
which is the preferred pattern, and long-running operations with a status monitor.
146147

147-
In both patterns, the processing of the operation is initiated by one API call and the client
148-
obtains the results of the operation from a subsequent API call.
149-
Here we illustrate the sequence of API calls involved in each of these patterns.
148+
The request that initiates a long-running operation returns a response that points to or embeds
149+
a _status monitor_, which is an ephemeral resource that will track the status and final result of the operation.
150+
The status monitor resource is distinct from the target resource (if any) and specific to the individual
151+
operation request.
152+
153+
A POST or DELETE operation returns a `202 Accepted` response with the status monitor in the response body.
154+
A long-running POST should not be used for resource create -- use PUT as described below.
155+
PATCH must never be used for LROs -- it should be reserved for simple resource updates.
156+
If a long-running update is required it should be implemented with POST.
157+
158+
There is a special form of long-running operation initiated with PUT that is described
159+
in [Create (PUT) with additional long-running processing](#create-put-with-additional-long-running-processing).
160+
The remainder of this section describes the pattern for long-running POST and DELETE operations.
161+
162+
This diagram illustrates how a long-running operation with a status monitor is initiated and then how the client
163+
determines it has completed and obtains its results:
164+
165+
```mermaid
166+
sequenceDiagram
167+
participant Client
168+
participant API Endpoint
169+
participant Status Monitor
170+
Client->>API Endpoint: POST/DELETE
171+
API Endpoint->>Client: HTTP/1.1 202 Accepted<br/>Retry-After: 5<br/>{ "id": "22", "status": "NotStarted" }
172+
Client->>Status Monitor: GET
173+
Status Monitor->>Client: HTTP/1.1 200 Ok<br/>Retry-After: 5<br/>{ "id": "22", "status": "Running" }
174+
Client->>Status Monitor: GET
175+
Status Monitor->>Client: HTTP/1.1 200 Ok<br/>{ "id": "22", "status": "Succeeded" }
176+
```
150177

151-
### Resource-based long-running operations
178+
1. The client sends the request to initiate the long-running operation.
179+
The initial request could be a PUT, POST, or DELETE method.
180+
The request may contain an `operation-id` header that the service uses as the ID of the status monitor created for the operation.
181+
182+
2. The service validates the request and initiates the operation processing.
183+
If there are any problems with the request, the service responds with a `4xx` status code and error response body.
184+
Otherwise the service responds with a `202-Accepted` HTTP status code.
185+
The response body is the status monitor for the operation including the ID, either from the request header or generated by the service.
186+
When returning a status monitor whose status is not in a terminal state, the response must also include a `retry-after` header indicating the minimum number of seconds the client should wait
187+
before polling (GETting) the status monitor URL again for an update.
188+
For backward compatibility, the response may also include an `Operation-Location` header containing the absolute URL
189+
of the status monitor resource (without an api-version query parameter).
152190

153-
In the RELO pattern, the resource that is the target of the operation contains a `status` field
154-
that holds the status of an outstanding or last completed operation.
155-
This means that the client can use a standard "get" operation on the resource to determine the
156-
status of an operation it initiated. The flow looks like this:
191+
3. After waiting at least the amount of time specified by the previous response's `Retry-after` header,
192+
the client issues a GET request to the status monitor using the ID in the body of the initial response.
193+
The GET operation for the status monitor is documented in the REST API definition and the ID
194+
is the last URL path segment.
157195

158-
<!-- markdownlint-disable MD033 -->
159-
<p align="center">
160-
<img src="./relo.jpg" alt="The RELO flow"/>
161-
</p>
162-
<!-- markdownlint-enable MD033 -->
196+
4. The status monitor responds with information about the operation including its current status,
197+
which should be represented as one of a fixed set of string values in a field named `status`.
198+
If the operation is still being processed, the status field will contain a "non-terminal" value, like `NotStarted` or `Running`.
163199

164-
1. The client sends the initial request to the resource to initiate the long-running operation.
165-
This initial request could be a PUT, PATCH, POST, or DELETE method.
200+
5. After the operation processing completes, a GET request to the status monitor returns the status monitor with a status field set to a terminal value -- `Succeeded`, `Failed`, or `Canceled` -- that indicates the result of the operation.
201+
If the status is `Failed`, the status monitor resource contains an `error` field with a `code` and `message` that describes the failure.
202+
If the status is `Succeeded` and the LRO is an Action operation, the operation results will be returned in the `results` field of the status monitor.
203+
If the status is `Succeeded` and the LRO is an operation on a resource, the client can perform a GET on the resource
204+
to observe the result of the operation if desired.
166205

167-
2. The resource validates the request and initiates the operation processing.
168-
It sends a response to client with a `200-OK` HTTP status code (or `201-Created` if the operation
169-
is a create operation) and a representation of the resource where the `status` field is set
170-
to a value indicating that the operation processing has been started.
206+
6. There may be some cases where a long-running operation can be completed before the response to the initial request.
207+
In these cases, the operation should still return a `202 Accepted` with the `status` property set to the appropriate terminal state.
171208

172-
3. The client then issues a GET request to the resource to determine if the operation processing
173-
has completed.
209+
7. The service will auto-purge the status monitor resource after completion (at least 24 hours).
210+
The service may offer DELETE of the status monitor resource due to GDPR/privacy.
174211

175-
4. The resource responds with a representation of the resource. While the operation is still being
176-
processed, the status field will contain a "non-terminal" value, like `Processing`.
212+
### Long-running Action Operations
177213

178-
5. After the operation processing has completed, a GET request from the client will receive a response
179-
where the status field contains a "terminal" value -- `Succeeded`, `Failed`, or `Canceled` --
180-
that indicates the result of the operation.
214+
An action operation that is also long-running combines the [Action Operations](#action-operations) pattern
215+
with the [Long Running Operations](#long-running-operations) pattern.
181216

182-
A resource may support multiple outstanding RELO operations, where the status field of the resource
183-
indicates the combined status of the outstanding operations.
184-
If a new operation request is received when there is already a long-running operation in progress for a resource,
185-
the service should reject the operation if it is inconsistent with one already in progress.
186-
However, if the new operation is redundant or not inconsistent with the one in progress,
187-
for example a "reboot" operation on a VM that is in the process of rebooting, then the service should
188-
accept the request. The status field of the resource should then report the completion status of _both_
189-
operations.
217+
The operation is initiated with a POST operation and the operation path ends in `:action`.
190218

191-
Note: The RELO pattern should not be used in cases where the completion status of individual operations
192-
may be important to users, as opposed to simply learning that an operation of the type they requested
193-
(e.g. create a resource with a specific name) has successfully completed.
219+
```text
220+
POST /<service-or-resource-url>:action
221+
Operation-Id: 22
222+
223+
{
224+
"arg1": 123
225+
"arg2": "abc"
226+
}
227+
```
194228

195-
### Long-running operations with status monitor
229+
The response is a `202 Accepted` as described above.
230+
231+
```text
232+
HTTP/1.1 202 Accepted
233+
Operation-Location: https://<status-monitor-endpoint>/22
234+
Retry-After: 5
235+
236+
{
237+
"id": "22",
238+
"status": "NotStarted"
239+
}
240+
```
196241

197-
In the LRO with status monitor pattern, the status and results of the operation are encapsulated into
198-
a status monitor resource that is distinct from the target resource and specific to the individual
199-
operation request. Here's what the status monitor LRO pattern looks like:
242+
The client will issue a GET to the status monitor to obtain the status and result of the operation.
200243

201-
<!-- markdownlint-disable MD033 -->
202-
<p align="center">
203-
<img src="./statmon.jpg" alt="The status monitor LRO flow"/>
204-
</p>
205-
<!-- markdownlint-enable MD033 -->
244+
```text
245+
GET https://<status-monitor-endpoint>/22?api-version=2022-05-01
246+
```
206247

207-
1. The client sends the request to initiate the long-running operation.
208-
As in the RELO pattern, the initial request could be a PUT, PATCH, POST, or DELETE method.
248+
When the operation completes successfully, the result (if there is one) will be included in the `result` field of the status monitor.
209249

210-
2. The resource validates the request and initiates the operation processing.
211-
It sends a response to the client with a `202-Accepted` HTTP status code.
212-
Included in this response is an `Operation-location` response header with the absolute URL of
213-
status monitor for this specific operation.
214-
The response also includes a `Retry-after` header telling the client a minimum time to wait (in seconds)
215-
before sending a request to the status monitor URL.
250+
```text
251+
HTTP/1.1 200 Ok
252+
253+
{
254+
"id": "22",
255+
"status": "Succeeded",
256+
"result": { ... }
257+
}
258+
```
216259

217-
3. After waiting at least the amount of time specified by the previous response's `Retry-after` header,
218-
the client issues a GET request to the status monitor URL.
260+
### Create (PUT) with additional long-running processing
219261

220-
4. The status monitor URL responds with information about the operation including its current status,
221-
which should be represented as one of a fixed set of string values in a field named `status`.
222-
If the operation is still being processed, the status field will contain a "non-terminal" value, like `Processing`.
262+
A special case of long-running operation that occurs often is a PUT operation to create a resource
263+
that involves some additional long-running processing.
264+
One example is a resource requires physical resources (e.g. servers) to be "provisioned" to make the resource functional.
265+
In this case, the request may contain an `operation-id` header that the service will use as
266+
the ID of the status monitor created for the operation.
267+
268+
```text
269+
PUT /items/FooBar&api-version=2022-05-01
270+
Operation-Id: 22
271+
272+
{
273+
"prop1": 555,
274+
"prop2": "something"
275+
}
276+
```
277+
278+
In this case the response to the initial request is a `201 Created` to indicate that the resource has been created.
279+
The response body contains a representation of the created resource, which is the standard pattern for a create operation.
280+
A status monitor is created to track the additional processing and the ID of the status monitor
281+
is returned in the `Operation-Id` header of the response.
282+
The response may also include an `Operation-Location` header for backward compatibility.
283+
If the resource supports ETags, the response may contain an `etag` header and possibly an `etag` property in the resource.
284+
285+
```text
286+
HTTP/1.1 201 Created
287+
Operation-Id: 22
288+
Operation-Location: https://items/operations/22
289+
etag: "123abc"
290+
291+
{
292+
"id": "FooBar",
293+
"etag": "123abc",
294+
"prop1": 555,
295+
"prop2": "something"
296+
}
297+
```
223298

224-
5. After the operation processing completes, a GET request to status monitor URL returns a response with a status field containing a terminal value -- `Succeeded`, `Failed`, or `Canceled` -- that indicates the result of the operation.
225-
If the status is `Failed`, the status monitor resource must contain an `error` field with a `code` and `message` that describes the failure.
226-
If the status is `Succeeded`, the response may contain additional fields as appropriate, such as results
227-
of the operation processing.
299+
The client will issue a GET to the status monitor to obtain the status of the operation performing the additional processing.
228300

229-
An important distinction between RELO and status monitor LROs is that there is a unique status monitor for each
230-
status monitor LRO, whereas the status of all RELO operations is combined into the status of the resource.
231-
So status monitor LROs are "one-to-one" with their operation status, whereas RELO-style LROs are "many-to-one".
301+
```text
302+
GET https://items/operations/22?api-version=2022-05-01
303+
```
304+
305+
When the additional processing completes, the status monitor will indicate if it succeeded or failed.
306+
307+
```text
308+
HTTP/1.1 200 Ok
309+
310+
{
311+
"id": "22",
312+
"status": "Succeeded"
313+
}
314+
```
315+
316+
If the additional processing failed, the service may delete the original resource if it is not usable in this state.
317+
318+
### Long-running delete operation
319+
320+
A long-running delete operation follows the general pattern of a long-running operation --
321+
it returns a `202 Accepted` with a status monitor which the client uses to determine the outcome of the delete.
322+
323+
The resource being deleted should remain visible (returned from a GET) until the delete operation completes successfully.
324+
325+
When the delete operation completes successfully, a client must be able to create new resource with same name without conflicts.
326+
327+
### Controlling a long-running operation
328+
329+
It might be necessary to support some control action on a long-running operation, such as cancel.
330+
This is implemented as a POST on the status monitor endpoint with `:action` added.
331+
332+
```text
333+
POST /<status-monitor-url>:cancel
334+
```
335+
336+
A successful response to a control operation should be a `200 Ok` with a representation of the status monitor.
337+
338+
```text
339+
HTTP/1.1 200 OK
340+
341+
{
342+
"id": "22",
343+
"status": "Canceled"
344+
}
345+
```
232346

233347
## Errors
234348
One of the most important parts of service design is also one of the most overlooked. The errors returned by your service are a critical part of your developer experience and are part of your API contract. Your service and your customer's application together form a distributed system. Errors are inevitable, but well-designed errors can help you avoid costly customer support incidents by empowering customers to self-diagnose problems.

0 commit comments

Comments
 (0)