|
| 1 | +# Resiliency Policy Error Code Retries |
| 2 | + |
| 3 | +* Author(s): Anton Troshin (@antontroshin), Taction (@taction) |
| 4 | +* Updated: 2024-09-18 |
| 5 | + |
| 6 | +## Overview |
| 7 | + |
| 8 | +This is a design proposal to provide additional functionality for Dapr Resiliency Policy Retries to be able to enforce policy only on specific response error codes. |
| 9 | +It only focuses on the `retries` (https://docs.dapr.io/operations/resiliency/policies/#retries) part of the policy. |
| 10 | + |
| 11 | +## Background |
| 12 | + |
| 13 | +In some applications, some status codes may be used to indicate the business error, and retrying the operation might not be necessary or otherwise desirable. |
| 14 | +Customizing retry behavior will allow a more granular way to handle error codes that suit each use case. |
| 15 | +Currently, all errors are retried when the policy is applied. |
| 16 | +Some status codes are not retryable, and subsequent calls will result in the same error. Avoiding these retry calls will reduce the overall number of requests, traffic, and errors. |
| 17 | + |
| 18 | +## Related Items |
| 19 | + |
| 20 | +https://github.com/dapr/dapr/issues/6683 |
| 21 | +https://github.com/dapr/dapr/issues/6428 |
| 22 | +https://github.com/dapr/dapr/issues/7697 |
| 23 | + |
| 24 | +PR: |
| 25 | +https://github.com/dapr/dapr/pull/7132 |
| 26 | + |
| 27 | +Docs: |
| 28 | +https://github.com/dapr/docs/issues/4254 |
| 29 | +https://github.com/dapr/docs/issues/3859 |
| 30 | + |
| 31 | +## Expectations and alternatives |
| 32 | + |
| 33 | +* What is in scope for this proposal? |
| 34 | + - HTTP and gRPC Service Invocation, direct and proxied |
| 35 | + - Bindings |
| 36 | + - Pub/Sub |
| 37 | + |
| 38 | +## Implementation Details |
| 39 | + |
| 40 | +### Design |
| 41 | + |
| 42 | +Add a new object field to the `retries` policy Spec to allow the user to specify the status codes that should be retried. |
| 43 | +Separate fields for HTTP and gRPC. The new fields should be optional and will default to the existing behavior, which is to retry on all errors. |
| 44 | + |
| 45 | +### Example 1: |
| 46 | +In this example, the retry policy will retry **_only_** on HTTP 500 and HTTP status code range 502-504 (inclusive) and gRPC status code range 2-4 (inclusive). |
| 47 | +The rest of the status codes will not be retried. |
| 48 | + |
| 49 | +```yaml |
| 50 | +apiVersion: dapr.io/v1alpha1 |
| 51 | +kind: Resiliency |
| 52 | +metadata: |
| 53 | + name: myresiliency |
| 54 | +scopes: |
| 55 | + - app1 |
| 56 | +spec: |
| 57 | + policies: |
| 58 | + retries: |
| 59 | + pubsubRetry: |
| 60 | + policy: constant |
| 61 | + duration: 5s |
| 62 | + maxRetries: 10 |
| 63 | + matching: |
| 64 | + httpStatusCodes: "500,502-504" |
| 65 | + gRPCStatusCodes: "2-4" |
| 66 | +``` |
| 67 | +
|
| 68 | +### Example 2: |
| 69 | +In this example, the retry policy will retry **_only_** on gRPC status code range 1-15 (inclusive). |
| 70 | +However, this policy will not apply to the HTTP status codes, and they will be retried according to the default behavior, which is to retry on all errors. |
| 71 | +
|
| 72 | +```yaml |
| 73 | +apiVersion: dapr.io/v1alpha1 |
| 74 | +kind: Resiliency |
| 75 | +metadata: |
| 76 | + name: myresiliency |
| 77 | +scopes: |
| 78 | + - app1 |
| 79 | +spec: |
| 80 | + policies: |
| 81 | + retries: |
| 82 | + pubsubRetry: |
| 83 | + policy: constant |
| 84 | + duration: 5s |
| 85 | + maxRetries: 10 |
| 86 | + matching: |
| 87 | + gRPCStatusCodes: "1-15" |
| 88 | +``` |
| 89 | +
|
| 90 | +### Acceptable Values |
| 91 | +The acceptable values are the same as the ones defined in the [HTTP Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) and [gRPC Status Codes](https://grpc.io/docs/guides/status-codes/) documentation. |
| 92 | +
|
| 93 | +- HTTP: from 100 to 599 |
| 94 | +- gRPC: from 1 to 16 |
| 95 | +
|
| 96 | +### Setting Format |
| 97 | +Both the `httpStatusCodes` and `gRPCStatusCodes` fields are of type string and optional and can be set to a comma-separated list of status codes and/or ranges of status codes. |
| 98 | +The range must be in the format `<start>-<end>` (inclusive). Having more than one dash in the range is not allowed. |
| 99 | + |
| 100 | +### CRD Validation |
| 101 | + |
| 102 | +Both field values should be validated using Common Expression Language [CEL](https://kubernetes.io/docs/reference/using-api/cel/) |
| 103 | +In addition, see Kubebuilder documentation for [CRD Validation](https://book.kubebuilder.io/reference/markers/crd-validation) |
| 104 | + |
| 105 | +### Parsing the configuration |
| 106 | + |
| 107 | +The configuration values will be first parsed as comma-separated lists. |
| 108 | +Each entry in the list will be then parsed as a single status code or a range of status codes. |
| 109 | +Invalid entries will be logged and the Dapr runtime will fail to start. |
| 110 | + |
| 111 | +Example: |
| 112 | + |
| 113 | +```yaml |
| 114 | +apiVersion: dapr.io/v1alpha1 |
| 115 | +kind: Resiliency |
| 116 | +metadata: |
| 117 | + name: myresiliency |
| 118 | +scopes: |
| 119 | + - app1 |
| 120 | +spec: |
| 121 | + policies: |
| 122 | + retries: |
| 123 | + pubsubRetry: |
| 124 | + policy: constant |
| 125 | + duration: 5s |
| 126 | + maxRetries: 10 |
| 127 | + matching: |
| 128 | + httpStatusCodes: "500,502-504,15,404-405-500,-1,0," |
| 129 | +``` |
| 130 | +The steps to parse the configuration are: |
| 131 | +1. Split the `httpStatusCodes` configuration string `"500,502-504,15,404-405-500,-1,0,"` by the comma character resulting in the following list: `["500", "502-504", "15", "404-405-500", "-1", "0"]` ignoring the empty strings. |
| 132 | +2. For each entry in the list, parse it as a single status code or a range of status codes. |
| 133 | +3. If the entry is a single status code, add it to the list of status codes to retry. |
| 134 | +4. If the entry is a range of status codes (each field for the relevant HTTP or gRPC status codes), add all the status codes in the range to the list of status codes to retry. |
| 135 | +- 500 is **valid** code for HTTP |
| 136 | +- 502-504 **valid** range of codes for HTTP |
| 137 | +- 15 is **invalid** code for HTTP, error logged and application will fail to start |
| 138 | +- 404-405-500 is **invalid** range contains more than one dash, error logged and application will fail to start |
| 139 | +- -1 is ignored is **invalid** code for HTTP, error logged and application will fail to start |
| 140 | +- 0 is ignored is **invalid** code for HTTP, error logged and application will fail to start |
| 141 | + |
| 142 | +### Acceptance Criteria |
| 143 | + |
| 144 | +Integration and unit tests will be added to verify the new functionality. |
| 145 | + |
| 146 | +## Completion Checklist |
| 147 | + |
| 148 | +* Code changes |
| 149 | +* Tests added (e2e, unit) |
| 150 | +* Documentation |
0 commit comments