Skip to content

Commit 3c8b041

Browse files
authored
Merge pull request #7732 from Particular/john/docs-review
2 parents 44d8dd8 + 317ab17 commit 3c8b041

File tree

4 files changed

+43
-53
lines changed

4 files changed

+43
-53
lines changed

.devcontainer/Dockerfile

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
1-
# NET 6.0 is required for the docstool
21
# For more options see https://github.com/devcontainers/images/tree/main/src/dotnet
3-
FROM mcr.microsoft.com/devcontainers/dotnet:9.0
2+
FROM mcr.microsoft.com/devcontainers/dotnet:10.0-preview
43

54
# Required dependency for docstool libskiasharp
65
RUN sudo apt-get update && sudo apt-get install fontconfig -y

nservicebus/recoverability/index.md

Lines changed: 32 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Recoverability
33
summary: Explains how exceptions are handled, and actions retried, during message processing
44
component: Core
55
isLearningPath: true
6-
reviewed: 2023-09-15
6+
reviewed: 2025-08-13
77
redirects:
88
- nservicebus/how-do-i-handle-exceptions
99
- nservicebus/errors
@@ -13,94 +13,88 @@ related:
1313
- nservicebus/operations/transactions-message-processing
1414
---
1515

16-
Sometimes processing of a message fails. This could be due to a transient problem like a deadlock in the database, in which case retrying the message a few times should solve the issue. If the problem is more protracted, like a third party web service going down or a database being unavailable, solving the issue would take longer. It is therefore useful to wait longer before retrying the message again.
16+
Sometimes, processing a message fails. This could be due to a transient problem, such as a deadlock in the database, in which case retrying the message a few times should solve the issue. If the problem is more prolonged, such as a third-party web service going down or a database being unavailable, resolving the issue may take longer. In these cases, it is useful to wait longer before retrying the message again.
1717

18-
Recoverability is the built-in error handling capability. Recoverability enables to recover automatically, or in exceptional scenarios manually, from message failures. Recoverability wraps the message handling logic, including the user code with various layers of retrying logic. NServiceBus differentiates two types of retrying behaviors:
18+
Recoverability is the built-in error handling capability. It enables automatic, or in exceptional scenarios manual, recovery from message failures. Recoverability wraps the message handling logic, including user code, with various layers of retry logic. NServiceBus differentiates between two types of retry behaviors:
1919

20-
* Immediate retries (previously known as First-Level-Retries)
21-
* Delayed retries (previously known as Second-Level-Retries)
20+
* Immediate retries (previously known as First-Level Retries)
21+
* Delayed retries (previously known as Second-Level Retries)
2222

23-
An oversimplified mental model for Recoverability could be thought of a try / catch block surrounding the message handling infrastructure wrapped in a for loop:
23+
An oversimplified mental model for Recoverability is a try/catch block surrounding the message handling infrastructure, wrapped in a for loop:
2424

2525
snippet: Recoverability-pseudo-code
2626

27-
The reality is more complex, depending on the transport's capabilities, the transaction mode of the endpoint, and user customizations. For example, on a transactional endpoint it will roll back the receive transaction when an exception bubbles through to the NServiceBus infrastructure. The message is then returned to the input queue, and any messages that the user code tried to send or publish won't be sent out. The very least that recoverability will ensure is that messages which failed multiple times get moved to the configured error queue. The part of recoverability which is responsible to move failed messages to the error queue is called fault handling.
27+
The reality is more complex, depending on the transport's capabilities, the transaction mode of the endpoint, and user customizations. For example, on a transactional endpoint, it will roll back the receive transaction when an exception bubbles through to the NServiceBus infrastructure. The message is then returned to the input queue, and any messages that the user code tried to send or publish will not be sent out. At a minimum, recoverability ensures that messages which fail multiple times are moved to the configured error queue. The part of recoverability responsible for moving failed messages to the error queue is called fault handling.
2828

29-
To prevent sending all incoming messages to the error queue during a major system outage (e.g. when a database or a third-party service is down), the recoverability mechanism allows enabling [automatic rate-limiting](#automatic-rate-limiting). When enabled, NServiceBus detects the outage after a configured number of consecutive failures and automatically switches to rate-limiting mode. In this mode, only one message is attempted to probe if the problem persists. Once a message can be processed correctly, the system automatically switches to regular mode.
29+
To prevent sending all incoming messages to the error queue during a major system outage (e.g., when a database or a third-party service is down), the recoverability mechanism allows enabling [automatic rate-limiting](#automatic-rate-limiting). When enabled, NServiceBus detects the outage after a configured number of consecutive failures and automatically switches to rate-limiting mode. In this mode, only one message is attempted to probe if the problem persists. Once a message can be processed correctly, the system automatically switches back to regular mode.
3030

31-
When a message cannot be deserialized all retry mechanisms will be bypassed and the message will be moved directly to the error queue.
31+
When a message cannot be deserialized, all retry mechanisms will be bypassed and the message will be moved directly to the error queue.
3232

3333
## Immediate retries
3434

35-
By default up to five immediate retries are performed if the message processing results in exception being thrown. The [number of immediate retries can be configured](/nservicebus/recoverability/configure-immediate-retries.md).
35+
By default, up to five immediate retries are performed if message processing results in an exception being thrown. The [number of immediate retries can be configured](/nservicebus/recoverability/configure-immediate-retries.md).
3636

3737
The configured value describes the minimum number of times a message will be retried if its processing consistently fails. Especially in environments with competing consumers on the same queue, there is an increased chance of retrying a failing message more times across different endpoint instances.
3838

39-
4039
### Transport transaction requirements
4140

4241
The immediate retry mechanism is implemented by making the message available for consumption again, so that the endpoint can process it again without any delay. Immediate retries cannot be used when [transport transactions](/transports/transactions.md) are disabled.
4342

44-
4543
## Delayed retries
4644

47-
Delayed retries introduces another level of retry mechanism for messages that fail processing. Delayed retries schedules message delivery to the endpoint's input queue with increasing delay, by default first with 10 seconds delay, then 20, and lastly with 30 seconds delay. In each cycle, a full round of immediate retries will occur based on the configuration of the immediate retry policy. See [Total number of possible retries](#total-number-of-possible-retries) later in this document for more information on how immediate and delayed retries work together.
45+
Delayed retries introduce another level of retry mechanism for messages that fail processing. Delayed retries schedule message delivery to the endpoint's input queue with increasing delay, by default, first with a 10 second delay, then 20 seconds, and lastly 30 seconds. In each cycle, a full round of immediate retries will occur based on the configuration of the immediate retry policy. See [Total number of possible retries](#total-number-of-possible-retries) later in this document for more information on how immediate and delayed retries work together.
4846

49-
Delayed retries might be useful when dealing with unreliable third-party resources - for example, if there is a call to a web service in the handler, but the service goes down for a couple of seconds once in a while. Without delayed retries, the message is retried instantly and sent to the error queue. With delayed retries, the message is instantly retried, deferred for 10 seconds, and then retried again. This way, when the web service is available the message is processed just fine.
47+
Delayed retries are useful when dealing with unreliable third-party resources. For example, if there is a call to a web service in the handler, but the service goes down for a few seconds occasionally. Without delayed retries, the message is retried instantly and sent to the error queue. With delayed retries, the message is instantly retried, deferred for 10 seconds, and then retried again. This way, when the web service is available, the message is processed successfully.
5048

5149
For more information about how to configure delayed retries, refer to [configure delayed retries](configure-delayed-retries.md).
5250

53-
For more information how delayed retries work internally, refer to the [Delayed delivery - how it works](/nservicebus/messaging/delayed-delivery.md#how-it-works) section.
51+
For more information on how delayed retries work internally, refer to the [Delayed delivery - how it works](/nservicebus/messaging/delayed-delivery.md#how-it-works) section.
5452

5553
> [!NOTE]
56-
> Retrying messages for extended periods of time would hide failures from operators, thus preventing them from taking manual action to honor their Service Level Agreements. To avoid this, NServiceBus will make sure that the time between two consecutive delayed retries is no more than 24 hours before being sent the error queue.
57-
54+
> Retrying messages for extended periods can hide failures from operators, preventing them from taking manual action to honor their Service Level Agreements. To avoid this, NServiceBus ensures that the time between two consecutive delayed retries is no more than 24 hours before being sent to the error queue.
5855
5956
### Transport transaction requirements
6057

61-
The delayed retries mechanism is implemented by rolling back the [transport transaction](/transports/transactions.md) and scheduling the message for [delayed-delivery](/nservicebus/messaging/delayed-delivery.md). Aborting the receive operation when transactions are turned off would result in a message loss. Therefore delayed retries cannot be used when transport transactions are disabled and delayed-delivery is not supported.
58+
The delayed retries mechanism is implemented by rolling back the [transport transaction](/transports/transactions.md) and scheduling the message for [delayed delivery](/nservicebus/messaging/delayed-delivery.md). Aborting the receive operation when transactions are turned off would result in message loss. Therefore, delayed retries cannot be used when transport transactions are disabled and delayed delivery is not supported.
6259

6360
## Automatic rate limiting
6461

65-
The automatic rate limiting in response to consecutive message processing failures is designed to act as an [automatic circuit breaker](https://en.wikipedia.org/wiki/Circuit_breaker) preventing a large number of messages from being redirected to the `error` queue in the case of an outage of a resource required for processing of all messages (e.g. a database or a third-party service).
62+
Automatic rate limiting in response to consecutive message processing failures is designed to act as an [automatic circuit breaker](https://en.wikipedia.org/wiki/Circuit_breaker), preventing a large number of messages from being redirected to the `error` queue in the case of an outage of a resource required for processing all messages (e.g., a database or a third-party service).
6663

6764
The following code enables the detection of consecutive failures.
6865

6966
snippet: configure-consecutive-failures
7067

71-
When the endpoint detects a configured number of consecutive failures, it reacts by switching to a processing mode in which one message is attempted at a time. If processing fails, the endpoint waits for configured time and attempts to process the next message. The endpoint continues running in this mode until at least one message is processed successfully.
68+
When the endpoint detects a configured number of consecutive failures, it reacts by switching to a processing mode in which one message is attempted at a time. If processing fails, the endpoint waits for a configured time and attempts to process the next message. The endpoint continues running in this mode until at least one message is processed successfully.
7269

7370
### Considerations when configuring automatic rate limiting
7471

75-
1. The number of consecutive failures must be big enough so that it doesn't trigger rate-limiting when only a few failed messages are processed by the endpoint.
76-
2. Endpoints that process many different message types may not be a good candidates for this feature. When rate limiting is active, it affects the entire endpoint. Endpoints that are rate limited due to a failure for one message type will slow down processing of all message types handled by the endpoint.
72+
1. The number of consecutive failures must be large enough so that it does not trigger rate-limiting when only a few failed messages are processed by the endpoint.
73+
2. Endpoints that process many different message types may not be good candidates for this feature. When rate limiting is active, it affects the entire endpoint. Endpoints that are rate limited due to a failure for one message type will slow down processing of all message types handled by the endpoint.
7774

7875
## Fault handling
7976

80-
When messages continuously failed during the immediate and delayed retries mechanisms they will be moved to the [error queue](/nservicebus/recoverability/configure-error-handling.md).
81-
77+
When messages continuously fail during the immediate and delayed retry mechanisms, they will be moved to the [error queue](/nservicebus/recoverability/configure-error-handling.md).
8278

8379
### Transport transaction requirements
8480

85-
Fault handling doesn't require that the transport transaction is rolled back. A copy of the currently handled message is sent to the configured error queue and the current transaction will be marked as successfully processed. Therefore fault handling works with all supported [transport transaction modes](/transports/transactions.md).
86-
81+
Fault handling does not require that the transport transaction is rolled back. A copy of the currently handled message is sent to the configured error queue, and the current transaction will be marked as successfully processed. Therefore, fault handling works with all supported [transport transaction modes](/transports/transactions.md).
8782

8883
## Recoverability policy
8984

90-
It is possible to take full control over the whole Recoverability process using a [custom recoverability policy](/nservicebus/recoverability/custom-recoverability-policy.md).
85+
It is possible to take full control over the entire Recoverability process using a [custom recoverability policy](/nservicebus/recoverability/custom-recoverability-policy.md).
9186

9287
partial: unrecoverableexceptions
9388

9489
## Total number of possible retries
9590

96-
97-
The total number of possible retries can be calculated with the following formula
91+
The total number of possible retries can be calculated with the following formula:
9892

9993
```txt
10094
Attempts = (ImmediateRetries:NumberOfRetries + 1) * (DelayedRetries:NumberOfRetries + 1)
10195
```
10296

103-
Given a variety of immediate and delayed configuration values here are the resultant possible attempts.
97+
Given a variety of immediate and delayed configuration values, here are the resultant possible attempts:
10498

10599
| ImmediateRetries | DelayedRetries | Total possible attempts |
106100
|------------------|----------------|-------------------------|
@@ -120,25 +114,24 @@ Given a variety of immediate and delayed configuration values here are the resul
120114
### Scale-out multiplier
121115

122116
> [!NOTE]
123-
> Retry behavior can be interpreted as if retries result in duplicates when scaled-out. Retry behavior can result in excessive processing attempts but no duplicate messages are created. Ensure that logging uses unique identifiers for each endpoint instance.
117+
> Retry behavior can be interpreted as if retries result in duplicates when scaled out. Retry behavior can result in excessive processing attempts, but no duplicate messages are created. Ensure that logging uses unique identifiers for each endpoint instance.
124118
125-
If an endpoint is scaled-out the number of processing attempts increase if instances are retrieving messages from the same queue and the transport does not have a native delivery counter.
119+
If an endpoint is scaled out, the number of processing attempts increases if instances are retrieving messages from the same queue and the transport does not have a native delivery counter.
126120

127121
Affected transports:
128122

129123
- Azure Storage Queues
130124
- SQL Server
131125
- RabbitMQ
132126
- Amazon SQS
133-
- MSMQ (only if running multiple instance on the same machine)
127+
- MSMQ (only if running multiple instances on the same machine)
134128

135129
Unaffected transports:
136130

137131
- Azure Service Bus
138132
- Azure Service Bus Legacy
139133

140-
Azure Service Bus transports use a native delivery counter which is incremented after any endpoint instance fetches a message from a (shared) queue. The native delivery counter guarantees that the retry number is the same regardless if the endpoint is scaled out.
141-
134+
Azure Service Bus transports use a native delivery counter, which is incremented after any endpoint instance fetches a message from a (shared) queue. The native delivery counter guarantees that the retry number is the same regardless of whether the endpoint is scaled out.
142135

143136
The number of instances acts as a multiplier for the maximum number of attempts.
144137

@@ -149,7 +142,7 @@ Maximum Attempts = MinimumAttempts * NumberOfInstances
149142

150143
Example:
151144

152-
When taking the default values for immediate and delayed retries (five and three, respectively) and 6 instances the total number of attempts will be a minimum of `(5+1)*(3+1)=24` attempts and a maximum of `24*6=144` attempts.
145+
When using the default values for immediate and delayed retries (five and three, respectively) and 6 instances, the total number of attempts will be a minimum of `(5+1)*(3+1)=24` attempts and a maximum of `24*6=144` attempts.
153146

154147
## Retry logging
155148

@@ -167,21 +160,20 @@ This enables configuring alerts in a centralized logging solution. For example,
167160

168161
#if-version [,8)
169162

170-
Until version 8 the logger name used is **NServiceBus.RecoverabilityExecutor**
163+
Until version 8, the logger name used is **NServiceBus.RecoverabilityExecutor**
171164

172165
#end-if
173166

174167
#if-version [8,)
175168

176-
From version 8 the logger names used are:
169+
From version 8, the logger names used are:
177170

178171
* **NServiceBus.DelayedRetry** for delayed retries
179172
* **NServiceBus.ImmediateRetry** for immediate retries
180173
* **NServiceBus.MoveToError** for messages forwarded to the error queue
181174

182175
#end-if
183176

184-
185177
### Output example
186178

187179
Given the following configuration:
@@ -194,7 +186,6 @@ The output in the log will be:
194186

195187
snippet: RetryLogging
196188

197-
198189
## Recoverability memory consumption
199190

200-
MSMQ and SQL Server transport need to cache exceptions in memory for retries. Therefore, exceptions with a large memory footprint can cause high memory usage of the NServiceBus process. NServiceBus can cache up to 1,000 exceptions, capping the potential memory consumption to 1,000 x `<exception size>`. Refer to [this guide](/nservicebus/recoverability/lru-memory-consumption.md) to resolve problems due to excessive memory consumption.
191+
MSMQ and SQL Server transports need to cache exceptions in memory for retries. Therefore, exceptions with a large memory footprint can cause high memory usage of the NServiceBus process. NServiceBus can cache up to 1,000 exceptions, capping the potential memory consumption to 1,000 x `<exception size>`. Refer to [this guide](/nservicebus/recoverability/lru-memory-consumption.md) to resolve problems due to excessive memory consumption.

0 commit comments

Comments
 (0)