|
1 | 1 | ---
|
2 |
| -title: "Monitoring NServiceBus Demo - Struggling endpoints" |
3 |
| -reviewed: 2023-11-07 |
4 |
| -summary: Use the Particular Service Platform to find hidden problems in your solution. |
| 2 | +title: "Monitoring NServiceBus Demo - Struggling Endpoints" |
| 3 | +reviewed: 2025-08-22 |
| 4 | +summary: Use the Particular Service Platform to identify and diagnose hidden problems in your solution. |
5 | 5 | suppressRelated: true
|
6 | 6 | ---
|
7 | 7 |
|
8 | 8 | _Are any of the endpoints struggling?_
|
9 | 9 |
|
10 |
| -NServiceBus endpoints are designed to tolerate several types of failure. There are some early warning signs to be aware of that indicate that an endpoint is going to have a problem. |
11 |
| - |
12 |
| -This part of the tutorial guides you through how to use monitoring data to spot hidden problems in your NServiceBus system. |
| 10 | +This tutorial demonstrates how to use monitoring data in the Particular Service Platform to detect early warning signs and hidden issues in your NServiceBus system. You will learn how to spot struggling endpoints before they become critical problems. |
13 | 11 |
|
14 | 12 | include: monitoring-demo-walkthrough-solution
|
15 | 13 |
|
| 14 | +## Key metrics |
16 | 15 |
|
17 |
| -## Metrics |
18 |
| - |
19 |
| -One of the benefits of NServiceBus is that it can [handle transient errors](https://particular.net/blog/but-all-my-errors-are-severe) for you. If a network switch is being restarted or a web server is temporarily too busy to service requests, then an NServiceBus endpoint will roll the message it is processing back to its input queue and try again later. If the problem was short-lived and has since been corrected, then the message will process successfully when it is retried. If the problem is more permanent, the endpoint will eventually forward the message to an error queue. |
20 |
| - |
21 |
| -_Scheduled retry rate_ measures how often messages are failing and are marked to be retried. |
22 |
| - |
23 |
| -_Processing time_ is the time it takes for the endpoint to process a single message. A higher processing time indicates a slower endpoint and a lower processing time indicates a faster endpoint. Processing time is only measured for messages that are successfully processed. |
| 16 | +NServiceBus is designed to handle transient errors automatically. For example, if a network switch is restarted or a web server is temporarily unavailable, the endpoint will roll back the message to its input queue and retry later. If the issue is resolved quickly, the message will process successfully on retry. If the problem persists, the message will eventually be forwarded to the error queue. |
24 | 17 |
|
| 18 | +- **Scheduled retry rate**: Measures how often messages fail and are scheduled for retry. |
| 19 | +- **Processing time**: The time taken to process a single message. Higher processing times may indicate a struggling endpoint, while lower times suggest healthy performance. Only successfully processed messages are measured. |
25 | 20 |
|
26 |
| -## Sample walkthrough |
| 21 | +## Walkthrough: Identifying struggling endpoints |
27 | 22 |
|
28 |
| -The following walkthrough uses the sample solution to simulate problems with endpoints. |
| 23 | +Follow these steps to simulate and observe endpoint issues using the sample solution: |
29 | 24 |
|
30 |
| -**Run the sample solution. Open ServicePulse to the Monitoring tab.** |
| 25 | +1. Run the sample solution. |
| 26 | +2. Open ServicePulse and navigate to the Monitoring tab. |
31 | 27 |
|
32 |
| - |
| 28 | +  |
33 | 29 |
|
34 |
| -NServiceBus endpoints frequently rely on other resources to do their work. This might take the form of a database server that holds persisted data or a web server that hosts an API that the endpoint needs to call. The endpoints themselves are designed to tolerate failure, but there are some early indicators that failure is coming. |
| 30 | +Endpoints often depend on external resources, such as databases or web APIs. While endpoints are resilient to failures, monitoring can reveal early indicators of trouble. |
35 | 31 |
|
| 32 | +### Detecting slow message processing |
36 | 33 |
|
37 |
| -### Processing messages is getting slower |
| 34 | +A common early warning sign is an increase in message processing time. This may indicate that database queries or web API calls are taking longer than usual, signaling potential issues with dependent resources. |
38 | 35 |
|
39 |
| -The first indication that an endpoint is going to run into trouble is when processing messages starts to slow down. This is indicated by an increase in processing time. This means that database queries and web API calls are taking longer to process than they were before. |
| 36 | +**Simulate resource degradation:** |
40 | 37 |
|
41 |
| -**Find the Shipping endpoint windows and toggle the resource degradation simulation.** |
| 38 | +Find the Shipping endpoint window and toggle the resource degradation simulation. |
42 | 39 |
|
43 | 40 | 
|
44 | 41 |
|
45 |
| -Watch the processing time on the shipping endpoint. As the (simulated) third-party resources slow down, processing the messages takes longer and processing time goes up. To find the root cause, you need to know which message types are causing the problem. |
| 42 | +As the (simulated) third-party resources slow down, processing time for the Shipping endpoint increases. To diagnose the root cause, it's essential to identify which message types are affected. |
46 | 43 |
|
47 |
| -**In the ServicePulse UI, click the Shipping endpoint to open a detailed view.** |
| 44 | +**Analyze processing time by message type:** |
| 45 | + |
| 46 | +In the ServicePulse UI, click the Shipping endpoint to open a detailed view. |
48 | 47 |
|
49 | 48 | 
|
50 | 49 |
|
51 |
| -This screen shows a breakdown of processing time by message type. Even though the Shipping endpoint processes two types of message, only one of them is slowing down. There is something that is slowing down the processing of `OrderPlaced` events that is not affecting the processing of `OrderBilled` events. |
| 50 | +This view breaks down processing time by message type. In this case, only the `OrderPlaced` events are experiencing increased processing times, indicating an issue specific to that message type. |
52 | 51 |
|
53 | 52 | > [!NOTE]
|
54 |
| -> This example is a simulation, and there isn't a third party resource that is failing. We're just simulating it with `Task.Delay`. |
| 53 | +> This example uses simulation to mimic resource degradation (e.g., `Task.Delay`). |
55 | 54 |
|
56 |
| -**Find the Shipping endpoint window and toggle the resource degradation simulation off. Return the ServicePulse Monitoring tab.** |
| 55 | +**Observe recovery:** |
57 | 56 |
|
58 |
| -Now look at the processing time for the Shipping endpoint again. As soon as the remote resource recovers, the processing time snaps back to where it was before. This is what it looks like when a failing resource is restarted. |
| 57 | +Find the Shipping endpoint window and toggle the resource degradation simulation off. Return to the ServicePulse Monitoring tab. |
59 | 58 |
|
| 59 | +Once the remote resource is simulated to recover, the processing time for the Shipping endpoint should return to normal, demonstrating the impact of the failing resource. |
60 | 60 |
|
61 |
| -### Messages are being retried |
| 61 | +### Monitoring scheduled retry rate |
62 | 62 |
|
63 |
| -The second indication that an endpoint is running into problems is that message processing starts to fail, and the endpoint starts scheduling messages to be retried. When an exception is thrown in a message handler, NServiceBus will remove the message being processed from the queue that it came from and try to handle that message again at a later time. If the exception is caused by a temporary problem, then waiting for a small period and re-processing the message will succeed. |
| 63 | +Another critical metric is the scheduled retry rate, which indicates how often messages are failing and being retried. A sudden increase in this rate may suggest that an endpoint is struggling to process messages successfully. |
64 | 64 |
|
65 |
| -If there are occasional network outages or database deadlocks, this works well. The message still gets processed successfully, and the system continues as if nothing happened. When the rate of these errors starts to increase, it might mask a broader issue. |
| 65 | +**Simulate increased failure rate:** |
66 | 66 |
|
67 |
| -**Find the Billing endpoint UI and increase the failure rate to 30%.** |
| 67 | +Find the Billing endpoint UI and increase the failure rate to 30%. |
68 | 68 |
|
69 |
| -Now look at the scheduled retry rate for the Billing endpoint in the ServicePulse monitoring tab. Notice that even though the endpoint is encountering difficulties processing roughly a third of its messages, it is still able to process every message successfully after a couple of retries. |
| 69 | +Monitor the scheduled retry rate for the Billing endpoint in the ServicePulse monitoring tab. Despite the increased failure rate, the endpoint may still process messages successfully after a few retries. |
70 | 70 |
|
71 | 71 | > [!NOTE]
|
72 |
| -> As the endpoint is wasting resources attempting to process a message that fails, the number of successfully processed messages (throughput) goes down. This has the effect of forcing messages to spend longer in the input queue which can impact queue length and critical time as well (to find out why, see [Which endpoints have the most work to do?](./walkthrough-2.md)). |
| 72 | +> A higher failure rate can lead to decreased throughput, as the endpoint spends resources retrying failed messages. This may also impact queue length and critical time, as explained in [Which endpoints have the most work to do?](./walkthrough-2.md). |
73 | 73 |
|
74 |
| -If you are concerned about the number of messages that are being retried, check the endpoint logs. When messages are scheduled to be retried, details about the message and the failure are logged at the WARN log level. |
| 74 | +Check the endpoint logs for detailed information about retried messages, including the message content and the nature of the failure. |
75 | 75 |
|
| 76 | +### Identifying failed messages |
76 | 77 |
|
77 |
| -### Messages are failing, even after being retried |
| 78 | +The final indicator of a struggling endpoint is when messages consistently fail to process, even after being retried. NServiceBus will forward these messages to ServiceControl for manual intervention. |
78 | 79 |
|
79 |
| -The final indication that an endpoint is having problems is when messages fail to process. If, after some retry attempts, NServiceBus is still not able to successfully process a message, it will send the message to ServiceControl for manual intervention in ServicePulse. |
| 80 | +**Increase the failure rate to 90%:** |
80 | 81 |
|
81 |
| -**Find the Billing endpoint UI and increase the failure rate to 90%.** |
| 82 | +Find the Billing endpoint UI and increase the failure rate to 90%. |
82 | 83 |
|
83 |
| -With such a high failure rate, it won't take long before messages begin exceeding the number of retries configured for the Billing endpoint. When this happens, these failed messages will appear in the Failed Messages tab in ServicePulse. |
| 84 | +With a high failure rate, messages will quickly exceed the configured retry attempts and appear in the Failed Messages tab in ServicePulse. |
84 | 85 |
|
85 | 86 | 
|
86 | 87 |
|
87 |
| -When ServiceControl receives failed messages from an endpoint, it will group them according to the Exception Type and the place in the code where the exception is thrown. In ServicePulse you can open up an exception group and look at each failed individually. This includes a full stack-trace, as well as access to the message headers and the body of the message. |
| 88 | +ServiceControl groups failed messages by exception type and the location in the code where the exception occurred. In ServicePulse, you can examine each failed message individually, including the stack trace, message headers, and body. |
| 89 | + |
| 90 | +Once the underlying issue is resolved, you can retry all failed messages in bulk from ServicePulse. |
88 | 91 |
|
89 |
| -Once the conditions that led to the error are resolved, you can retry all of the messages in bulk from ServicePulse. |
| 92 | +**Retry failed messages:** |
90 | 93 |
|
91 |
| -**Find the Billing endpoint UI and decrease the failure rate back down to 0%. In the ServicePulse Failed Messages tab, click the Request retry button. Confirm that you are ready to retry the messages.** |
| 94 | +Find the Billing endpoint UI and decrease the failure rate back down to 0%. In the ServicePulse Failed Messages tab, click the Request retry button. Confirm that you are ready to retry the messages. |
92 | 95 |
|
93 |
| -ServiceControl will stage the messages to be retried and then return them to the Billing endpoint where they will be successfully processed. |
| 96 | +ServiceControl will stage the messages for retry and return them to the Billing endpoint for successful processing. |
94 | 97 |
|
95 | 98 | 
|
96 | 99 |
|
97 |
| -## Keep exploring the demo |
| 100 | +## Next Steps |
| 101 | + |
| 102 | +After identifying and resolving issues with struggling endpoints, consider exploring the following: |
98 | 103 |
|
99 |
| -- **[Which message types are taking the longest to process?](./walkthrough-1.md):** take a look at individual endpoint performance and decide where to optimize. |
100 |
| -- **[Which endpoints have the most work to do?](./walkthrough-2.md):** look for peaks of traffic and decide when to scale out. |
| 104 | +- **[Which message types are taking the longest to process?](./walkthrough-1.md):** Analyze individual endpoint performance to identify optimization opportunities. |
| 105 | +- **[Which endpoints have the most work to do?](./walkthrough-2.md):** Examine traffic patterns and determine optimal scaling strategies. |
101 | 106 |
|
102 | 107 | include: monitoring-demo-next-steps
|
0 commit comments