From 04323f94819cf7952092b8b10f8742f681c1ab3e Mon Sep 17 00:00:00 2001 From: Anton Troshin Date: Wed, 18 Sep 2024 20:47:13 -0500 Subject: [PATCH 1/5] Resiliency Policy Error Code Retries Signed-off-by: Anton Troshin --- 20240917-BR-resiliency-error-code-retries.md | 145 +++++++++++++++++++ templates/proposal.md | 4 +- 2 files changed, 147 insertions(+), 2 deletions(-) create mode 100644 20240917-BR-resiliency-error-code-retries.md diff --git a/20240917-BR-resiliency-error-code-retries.md b/20240917-BR-resiliency-error-code-retries.md new file mode 100644 index 0000000..6692dfa --- /dev/null +++ b/20240917-BR-resiliency-error-code-retries.md @@ -0,0 +1,145 @@ +# Resiliency Policy Error Code Retries + +* Author(s): Anton Troshin (@antontroshin), Taction (@taction) +* Updated: 2024-09-18 + +## Overview + +This is a design proposal to provide additional functionality for Dapr Resiliency Policy Retries to be able to enforce policy only on specific response error codes. +It only focuses on the `retries` (https://docs.dapr.io/operations/resiliency/policies/#retries) part of the policy. + +## Background + +In some applications, error codes may be used to indicate the business error, and retrying the operation might not be necessary or otherwise desirable. +Customizing retry behavior will allow a more granular way to handle error codes that suit each use case. +Currently, all errors are retried when the policy is applied. +Some errors are not retryable, and subsequent calls will result in the same error, avoiding these retry calls will reduce the overall amount of requests, traffic, and errors. + +## Related Items + +https://github.com/dapr/dapr/issues/6683 +https://github.com/dapr/dapr/issues/6428 +https://github.com/dapr/dapr/issues/7697 + +PR: +https://github.com/dapr/dapr/pull/7132 + +Docs: +https://github.com/dapr/docs/issues/4254 +https://github.com/dapr/docs/issues/3859 + +## Expectations and alternatives + +* What is in scope for this proposal? + - HTTP and gRPC Service Invocation, direct and proxied + - Bindings + - Pub/Sub + +## Implementation Details + +### Design + +Add a new object field to the `retries` policy Spec to allow the user to specify the error codes that should be retried. +Separate fields for HTTP and gRPC. The new fields should be optional and will default to the existing behavior, which is to retry on all errors. + +### Example 1: +In this example, the retry policy will retry **_only_** on HTTP 500 and HTTP error range 502-504 (inclusive) and gRPC error range 2-4 (inclusive). +The rest of the errors will not be retried. + +```yaml +apiVersion: dapr.io/v1alpha1 +kind: Resiliency +metadata: + name: myresiliency +scopes: + - app1 +spec: + policies: + retries: + pubsubRetry: + policy: constant + duration: 5s + maxRetries: 10 + matching: + httpStatusCodes: "500,502-504" + gRPCStatusCodes: "2-4" +``` + +### Example 2: +In this example, the retry policy will retry **_only_** on gRPC error range 1-15 (inclusive). +However, this policy will not apply to the HTTP errors, and they will be retried according to the default behavior, which is to retry on all errors. + +```yaml +apiVersion: dapr.io/v1alpha1 +kind: Resiliency +metadata: + name: myresiliency +scopes: + - app1 +spec: + policies: + retries: + pubsubRetry: + policy: constant + duration: 5s + maxRetries: 10 + matching: + gRPCStatusCodes: "1-15" +``` + +### Acceptable Values +The acceptable values are the same as the ones defined in the [HTTP Status Codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) and [gRPC Status Codes](https://grpc.io/docs/guides/status-codes/) documentation. + +- HTTP: from 100 to 599 +- gRPC: from 1 to 16 + +### Setting Format +Both the `httpStatusCodes` and `gRPCStatusCodes` fields are of type string and optional and can be set to a comma-separated list of error codes and/or ranges of error codes. +The range must be in the format `-` (inclusive). Having more than one dash in the range is not allowed. + +### Parsing the configuration + +The configuration values will be first parsed as comma-separated lists. +Each entry in the list will be then parsed as a single error code or a range of error codes. +For invalid entries, the error will be logged when the policy is first loaded and the entry will be ignored, this will not fail the entire policy or the application start. + +Example: + +```yaml +apiVersion: dapr.io/v1alpha1 +kind: Resiliency +metadata: + name: myresiliency +scopes: + - app1 +spec: + policies: + retries: + pubsubRetry: + policy: constant + duration: 5s + maxRetries: 10 + matching: + httpStatusCodes: "500,502-504,15,404-405-500,-1,0," +``` +The steps to parse the configuration are: +1. Split the `httpStatusCodes` configuration string `"500,502-504,15,404-405-500,-1,0,"` by the comma character resulting in the following list: `["500", "502-504", "15", "404-405-500", "-1", "0"]` ignoring the empty strings. +2. For each entry in the list, parse it as a single error code or a range of error codes. +3. If the entry is a single error code, add it to the list of error codes to retry. +4. If the entry is a range of error codes (each field for the relevant HTTP or gRPC error codes), add all the error codes in the range to the list of error codes to retry. +- 500 is **valid** code for HTTP +- 502-504 **valid** range of codes for HTTP +- 15 is **invalid** code for HTTP, error logged and entry ignored +- 404-405-500 is **invalid** range contains more than one dash, error logged and entry ignored +- -1 is ignored is **invalid** code for HTTP, error logged and entry ignored +- 0 is ignored is **invalid** code for HTTP, error logged and entry ignored + +### Acceptance Criteria + +Integration and unit tests will be added to verify the new functionality. + +## Completion Checklist + +* Code changes +* Tests added (e2e, unit) +* Documentation diff --git a/templates/proposal.md b/templates/proposal.md index e5c03b7..29c2484 100644 --- a/templates/proposal.md +++ b/templates/proposal.md @@ -13,7 +13,7 @@ A brief description of the proposal; include information such as: ## Background -This section is intented to provide the community with the reasoning behind this proposal -- why is this proposal being made? What problem is it solving for users / developers / operators and how does it solve that for them? +This section is intended to provide the community with the reasoning behind this proposal -- why is this proposal being made? What problem is it solving for users / developers / operators and how does it solve that for them? ## Related Items @@ -56,7 +56,7 @@ How will this work, technically? Where applicable, include: How will success be measured? * Performance targets -* Compabitility requirements +* Compatibility requirements * Metrics ## Completion Checklist From 9581cd573ea0f091a294295738db9e784ae5ad5f Mon Sep 17 00:00:00 2001 From: Anton Troshin Date: Thu, 19 Sep 2024 12:31:13 -0500 Subject: [PATCH 2/5] Change the behavior of the policy loading on parsing error Change some wording to be more consistent Signed-off-by: Anton Troshin --- 20240917-BR-resiliency-error-code-retries.md | 34 ++++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/20240917-BR-resiliency-error-code-retries.md b/20240917-BR-resiliency-error-code-retries.md index 6692dfa..5ff0097 100644 --- a/20240917-BR-resiliency-error-code-retries.md +++ b/20240917-BR-resiliency-error-code-retries.md @@ -10,10 +10,10 @@ It only focuses on the `retries` (https://docs.dapr.io/operations/resiliency/pol ## Background -In some applications, error codes may be used to indicate the business error, and retrying the operation might not be necessary or otherwise desirable. +In some applications, some status codes may be used to indicate the business error, and retrying the operation might not be necessary or otherwise desirable. Customizing retry behavior will allow a more granular way to handle error codes that suit each use case. Currently, all errors are retried when the policy is applied. -Some errors are not retryable, and subsequent calls will result in the same error, avoiding these retry calls will reduce the overall amount of requests, traffic, and errors. +Some status codes are not retryable, and subsequent calls will result in the same error. Avoiding these retry calls will reduce the overall number of requests, traffic, and errors. ## Related Items @@ -39,12 +39,12 @@ https://github.com/dapr/docs/issues/3859 ### Design -Add a new object field to the `retries` policy Spec to allow the user to specify the error codes that should be retried. +Add a new object field to the `retries` policy Spec to allow the user to specify the status codes that should be retried. Separate fields for HTTP and gRPC. The new fields should be optional and will default to the existing behavior, which is to retry on all errors. ### Example 1: -In this example, the retry policy will retry **_only_** on HTTP 500 and HTTP error range 502-504 (inclusive) and gRPC error range 2-4 (inclusive). -The rest of the errors will not be retried. +In this example, the retry policy will retry **_only_** on HTTP 500 and HTTP status code range 502-504 (inclusive) and gRPC status code range 2-4 (inclusive). +The rest of the status codes will not be retried. ```yaml apiVersion: dapr.io/v1alpha1 @@ -66,8 +66,8 @@ spec: ``` ### Example 2: -In this example, the retry policy will retry **_only_** on gRPC error range 1-15 (inclusive). -However, this policy will not apply to the HTTP errors, and they will be retried according to the default behavior, which is to retry on all errors. +In this example, the retry policy will retry **_only_** on gRPC status code range 1-15 (inclusive). +However, this policy will not apply to the HTTP status codes, and they will be retried according to the default behavior, which is to retry on all errors. ```yaml apiVersion: dapr.io/v1alpha1 @@ -94,14 +94,14 @@ The acceptable values are the same as the ones defined in the [HTTP Status Codes - gRPC: from 1 to 16 ### Setting Format -Both the `httpStatusCodes` and `gRPCStatusCodes` fields are of type string and optional and can be set to a comma-separated list of error codes and/or ranges of error codes. +Both the `httpStatusCodes` and `gRPCStatusCodes` fields are of type string and optional and can be set to a comma-separated list of status codes and/or ranges of status codes. The range must be in the format `-` (inclusive). Having more than one dash in the range is not allowed. ### Parsing the configuration The configuration values will be first parsed as comma-separated lists. -Each entry in the list will be then parsed as a single error code or a range of error codes. -For invalid entries, the error will be logged when the policy is first loaded and the entry will be ignored, this will not fail the entire policy or the application start. +Each entry in the list will be then parsed as a single status code or a range of status codes. +Invalid entries will be logged and the application will fail to start. Example: @@ -124,15 +124,15 @@ spec: ``` The steps to parse the configuration are: 1. Split the `httpStatusCodes` configuration string `"500,502-504,15,404-405-500,-1,0,"` by the comma character resulting in the following list: `["500", "502-504", "15", "404-405-500", "-1", "0"]` ignoring the empty strings. -2. For each entry in the list, parse it as a single error code or a range of error codes. -3. If the entry is a single error code, add it to the list of error codes to retry. -4. If the entry is a range of error codes (each field for the relevant HTTP or gRPC error codes), add all the error codes in the range to the list of error codes to retry. +2. For each entry in the list, parse it as a single status code or a range of status codes. +3. If the entry is a single status code, add it to the list of status codes to retry. +4. If the entry is a range of status codes (each field for the relevant HTTP or gRPC status codes), add all the status codes in the range to the list of status codes to retry. - 500 is **valid** code for HTTP - 502-504 **valid** range of codes for HTTP -- 15 is **invalid** code for HTTP, error logged and entry ignored -- 404-405-500 is **invalid** range contains more than one dash, error logged and entry ignored -- -1 is ignored is **invalid** code for HTTP, error logged and entry ignored -- 0 is ignored is **invalid** code for HTTP, error logged and entry ignored +- 15 is **invalid** code for HTTP, error logged and application will fail to start +- 404-405-500 is **invalid** range contains more than one dash, error logged and application will fail to start +- -1 is ignored is **invalid** code for HTTP, error logged and application will fail to start +- 0 is ignored is **invalid** code for HTTP, error logged and application will fail to start ### Acceptance Criteria From 4d2593749dbc298c60cb1d4bbab7341c9356120e Mon Sep 17 00:00:00 2001 From: Artur Souza Date: Mon, 23 Sep 2024 12:56:16 -0700 Subject: [PATCH 3/5] Update 20240917-BR-resiliency-error-code-retries.md Signed-off-by: Artur Souza --- 20240917-BR-resiliency-error-code-retries.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/20240917-BR-resiliency-error-code-retries.md b/20240917-BR-resiliency-error-code-retries.md index 5ff0097..39b485f 100644 --- a/20240917-BR-resiliency-error-code-retries.md +++ b/20240917-BR-resiliency-error-code-retries.md @@ -101,7 +101,7 @@ The range must be in the format `-` (inclusive). Having more than on The configuration values will be first parsed as comma-separated lists. Each entry in the list will be then parsed as a single status code or a range of status codes. -Invalid entries will be logged and the application will fail to start. +Invalid entries will be logged and the Dapr runtime will fail to start. Example: From cc4dbf718f1a24c6d992297b366d723ec923291b Mon Sep 17 00:00:00 2001 From: Anton Troshin Date: Tue, 24 Sep 2024 19:13:33 -0500 Subject: [PATCH 4/5] Add CRD validation section Signed-off-by: Anton Troshin --- 20240917-BR-resiliency-error-code-retries.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/20240917-BR-resiliency-error-code-retries.md b/20240917-BR-resiliency-error-code-retries.md index 5ff0097..6d246e6 100644 --- a/20240917-BR-resiliency-error-code-retries.md +++ b/20240917-BR-resiliency-error-code-retries.md @@ -97,6 +97,10 @@ The acceptable values are the same as the ones defined in the [HTTP Status Codes Both the `httpStatusCodes` and `gRPCStatusCodes` fields are of type string and optional and can be set to a comma-separated list of status codes and/or ranges of status codes. The range must be in the format `-` (inclusive). Having more than one dash in the range is not allowed. +### CRD Validation + +Both field values should be validated using Common Expression Language [CEL](https://kubernetes.io/docs/reference/using-api/cel/) + ### Parsing the configuration The configuration values will be first parsed as comma-separated lists. From 0499fed0bc9f61aa7867105d1a48baf58307e0e6 Mon Sep 17 00:00:00 2001 From: Anton Troshin Date: Thu, 26 Sep 2024 12:39:22 -0500 Subject: [PATCH 5/5] Update CRD validation and add Kubebuilder documentation link Signed-off-by: Anton Troshin --- 20240917-BR-resiliency-error-code-retries.md | 1 + 1 file changed, 1 insertion(+) diff --git a/20240917-BR-resiliency-error-code-retries.md b/20240917-BR-resiliency-error-code-retries.md index 0d8dc8d..6b35651 100644 --- a/20240917-BR-resiliency-error-code-retries.md +++ b/20240917-BR-resiliency-error-code-retries.md @@ -100,6 +100,7 @@ The range must be in the format `-` (inclusive). Having more than on ### CRD Validation Both field values should be validated using Common Expression Language [CEL](https://kubernetes.io/docs/reference/using-api/cel/) +In addition, see Kubebuilder documentation for [CRD Validation](https://book.kubebuilder.io/reference/markers/crd-validation) ### Parsing the configuration