Merge pull request #311 from tg-msft/error-codes

markweitzel · web-flow · commit df65ecefcfc0 · 2022-02-01T18:35:09.000-05:00
Update error guidance for Azure data plane services
diff --git a/azure/ConsiderationsForServiceDesign.md b/azure/ConsiderationsForServiceDesign.md
@@ -229,6 +229,34 @@ An important distinction between RELO and status monitor LROs is that there is a
 status monitor LRO, whereas the status of all RELO operations is combined into the status of the resource.
 So status monitor LROs are "one-to-one" with their operation status, whereas RELO-style LROs are "many-to-one".
 
+## Errors
+One of the most important parts of service design is also one of the most overlooked.  The errors returned by your service are a critical part of your developer experience and are part of your API contract.  Your service and your customer's application together form a distributed system.  Errors are inevitable, but well-designed errors can help you avoid costly customer support incidents by empowering customers to self-diagnose problems.
+
+First, you should always try to design errors out of existence if possible.  You'll get a lot of this for free by following the [API Guidelines](https://aka.ms/azapi/guidelines).  Some examples include:
+- Idempotent APIs solve a whole class of network issues where customers have no idea how to proceed if they send a request to the service but never get a response.
+- Accessing resources from multiple microservices can quickly lead to complex race conditions. These can be avoided by supporting conditional requests through an [optimistic concurrency strategy](https://github.com/microsoft/api-guidelines/blob/vNext/azure/Guidelines.md#optimistic-concurrency), e.g. by leveraging `If-Match`/`If-None-Match` request headers.
+- Reframing the purpose of an API can obviate some errors.  This is most often specific to your operations, but an [example from the API Guidelines](https://github.com/microsoft/api-guidelines/blob/vNext/azure/Guidelines.md#http-return-codes) is thinking about `DELETE`s as _"ensure no resource at this location exists"_ so they can return an easier to use `204` instead of _"delete this exact resource instance"_ which would fail with a `404`.
+
+There are two types of errors returned from your service and customers handle them differently.
+- Usage errors where the customer is calling your API incorrectly.  The customer can easily make these errors go away by fixing their code.  We expect most usage errors to be found during testing.
+- Runtime errors that can't be prevented by the customer and need to be recovered from.  Some runtime errors like `429` throttling will be handled automatically by client libraries, but most will be situations like a `409` conflict requiring knowledge about the customer's application to remedy.
+
+You should use appropriate [HTTP status codes](https://developer.mozilla.org/docs/Web/HTTP/Status#client_error_responses) for customers to handle errors generically and error code strings in our common error schema and the `x-ms-error-code` header for customers to handle errors specifically.  As an example, consider what a customer would do when trying to get the properties of a Storage blob:
+- A `404` status code tells them the blob doesn't exist and the customer can report the error to their users
+- A `BlobNotFound` or `ContainerNotFound` error code will tell them why the blob doesn't exist so they can take steps to recreate it
+
+The [common error schema in the Guidelines](https://github.com/microsoft/api-guidelines/blob/vNext/azure/Guidelines.md#handling-errors) allows nested details and inner errors that have their own error codes, but the top-level error code is the most important.  The HTTP status code and the top-level error code are the only part of your error that we consider part of your API contract that follows the same compatibility requirements as the rest of your API.  Importantly, this means you **changing the HTTP status code or top-level error code for an API is a breaking change**.  You can only return new status codes and error codes in future API versions if customers make use of new features that trigger new classes of errors.  Battle tested error handling is some of the hardest code to get right and we can't break that for customers when they upgrade to the latest version.  The rest of the properties in your error like `message`, `details`, etc., are not considered part of your API contract and can change to improve the diagnosability of your service.
+
+You should also return the top-level error code as the `x-ms-error-code` response header so client libraries have the ability to automatically retry requests when possible without having to parse a JSON payload.  We recommend unique error codes like `ContainerBeingDeleted` for every distinct recoverable error that can occur, but suggest reusing common error codes like `InvalidHeaderValue` for usage errors where a descriptive error message is more important for resolving the problem.  The Storage [Common](https://docs.microsoft.com/rest/api/storageservices/common-rest-api-error-codes) and [Blob](https://docs.microsoft.com/rest/api/storageservices/blob-service-error-codes) error codes are a good starting point if you're looking for examples.  You can [define an enum in your spec](https://github.com/Azure/azure-rest-api-specs/blob/main/specification/storage/data-plane/Microsoft.BlobStorage/preview/2021-04-10/blob.json#L10419) with `"modelAsString": true` that lists all of the top-level error codes to make it [easier for your customers to handle specific error codes](https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/storage/Azure.Storage.Blobs#troubleshooting).
+
+You should not document specific error status codes in your OpenAPI/Swagger spec.  The `"default"` response is the only thing AutoRest considers an error response unless you provide other annotations.  Every unique status code turns into a separate code path in your client libraries so we do not encourage this practice.  The only reason to document specific error status codes is if they return a different error response than the default, but that is also heavily discouraged.
+
+Be as precise as possible when writing error messages.  A message with just `Invalid Argument` is almost useless to a customer who sent 100KB of JSON to your endpoint.  ``Query parameter `top` must be less than or equal to 1000`` tells a customer exactly what went wrong so they can quickly fix the problem.  Don't go overboard while writing great, understandable error messages and include any sensitive customer information or secrets though.  Many developers will blindly write any error to logs that don't have the same level of access control as Azure resources.
+
+All responses should include the `x-ms-request-id` header with a unique id for the request, but this is particularly important for error responses.  Service logs for the request should contain the `x-ms-request-id` so that support staff can use this value to diagnose specific customer reported errors.
+
+Finally, write sample code for your service's workflow and add the code you'd want customers using to gracefully recover from errors.  Is it actually graceful?  Is it something you'd be comfortable asking most customers to write?  We also highly encourage reaching out to customers during private preview and asking them for code they've written against your service.  Their error handling might match your expectations, you might find a strong need for better documentation, or you might find important opportunities to improve the errors you're returning.
+
 ## Getting Help: The Azure REST API Stewardship Board
 The Azure REST API Stewardship board is a collection of dedicated architects that are passionate about helping Azure service teams build interfaces that are intuitive, maintainable, consistent, and most importantly, delight our customers. Because APIs affect nearly all downstream decisions, you are encouraged to reach out to the Stewardship board early in the development process. These architects will work with you to apply these guidelines and identify any hidden pitfalls in your design.
 
diff --git a/azure/Guidelines.md b/azure/Guidelines.md
@@ -292,11 +292,15 @@ There are 2 kinds of errors:
 
 :white_check_mark: **DO** return an `x-ms-error-code` response header with a string error code indicating what went wrong.
 
-*NOTE: Error code values are part of your API contract (because customer code is likely to do comparisons against them) and cannot change in the future.*
+*NOTE: `x-ms-error-code` values are part of your API contract (because customer code is likely to do comparisons against them) and cannot change in the future.*
 
-:white_check_mark: **DO** carefully craft `x-ms-error-code` string values for errors that are recoverable at runtime.
+:heavy_check_mark: **YOU MAY** implement the `x-ms-error-code` values as an enum with `"modelAsString": true` because it's possible add new values over time.  In particular, it's only a breaking change if the same conditions result in a *different* top-level error code.
 
-:white_check_mark: **DO** ensure that the top-level `code` field's value is identical to the `x-ms-error-code` header's value.
+:warning: **YOU SHOULD NOT** add new top-level error codes to an existing API without bumping the service version.
+
+:white_check_mark: **DO** carefully craft unique `x-ms-error-code` string values for errors that are recoverable at runtime.  Reuse common error codes for usage errors that are not recoverable.
+
+:white_check_mark: **DO** ensure that the top-level error's `code` value is identical to the `x-ms-error-code` header's value.
 
 :white_check_mark: **DO** document the service's error code strings; they are part of the API contract.
 
@@ -306,7 +310,7 @@ There are 2 kinds of errors:
 
 Property | Type | Required | Description
 -------- | ---- | :------: | -----------
-`error` | ErrorDetail | ✔ | The error object.
+`error` | ErrorDetail | ✔ | The top-level error object whose `code` matches the `x-ms-error-code` response header
 
 **ErrorDetail** : Object
 
@@ -317,6 +321,7 @@ Property | Type | Required | Description
 `target` | String |  | The target of the error.
 `details` | ErrorDetail[] |  | An array of details about specific errors that led to this reported error.
 `innererror` | InnerError |  | An object containing more specific information than the current object about the error.
+_additional properties_ |   | | Additional properties that can be useful when debugging.
 
 **InnerError** : Object
 
@@ -344,6 +349,11 @@ Example:
 
 :heavy_check_mark: **YOU MAY** treat the other fields as you wish as they are _not_ considered part of your service's API contract and customers should not take a dependency on them or their value. They exist to help customers self-diagnose issues.
 
+:heavy_check_mark: **YOU MAY** add additional properties for any data values in your error message so customers don't resort to parsing your error message.  For example, an error with `"message": "A maximum of 16 keys are allowed per account."` might also add a `"maximumKeys": 16` property.  This is not part of your API contract and should only be used for diagnosing problems.
+
+*Note: Do not use this mechanism to provide information developers need to rely on in code (ex: the error message can give details about why you've been throttled, but the `Retry-After` should be what developers rely on to back off).*
+
+:warning: **YOU SHOULD NOT** document specific error status codes in your OpenAPI/Swagger spec unless the "default" response cannot properly describe the specific error response (e.g. body schema is different).
 
 ### JSON
 Services, and the clients that access them, may be written in multiple languages. To ensure interoperability, JSON establishes the "lowest common denominator" type system, which is always sent over the wire as UTF-8 bytes. This system is very simple and consists of three types: