Skip to content

Commit 8fbb56d

Browse files
Merge pull request #259736 from flang-msft/fxl---Update-cache-troubleshoot-timeouts.md-116387
Fxl update cache troubleshoot timeouts.md 116387
2 parents f999e13 + daf24a9 commit 8fbb56d

File tree

1 file changed

+27
-16
lines changed

1 file changed

+27
-16
lines changed

articles/azure-cache-for-redis/cache-troubleshoot-timeouts.md

Lines changed: 27 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ ms.author: franlanglois
66
ms.service: cache
77
ms.topic: conceptual
88
ms.custom: devx-track-csharp
9-
ms.date: 09/29/2023
9+
ms.date: 12/02/2023
10+
1011
---
1112

1213
# Troubleshoot Azure Cache for Redis latency and timeouts
@@ -25,6 +26,7 @@ This section discusses troubleshooting for latency and timeout issues that occur
2526
- [Server-side troubleshooting](#server-side-troubleshooting)
2627
- [Server maintenance](#server-maintenance)
2728
- [High server load](#high-server-load)
29+
- [[Spikes in server load](#spikes-in-server-load)]
2830
- [High memory usage](#high-memory-usage)
2931
- [Long running commands](#long-running-commands)
3032
- [Network bandwidth limitation](#network-bandwidth-limitation)
@@ -36,11 +38,13 @@ This section discusses troubleshooting for latency and timeout issues that occur
3638
3739
## Client-side troubleshooting
3840

41+
Here's the client-side troubleshooting.
42+
3943
### Traffic burst and thread pool configuration
4044

4145
Bursts of traffic combined with poor `ThreadPool` settings can result in delays in processing data already sent by the Redis server but not yet consumed on the client side. Check the metric "Errors" (Type: UnresponsiveClients) to validate if your client hosts can keep up with a sudden spike in traffic.
4246

43-
Monitor how your `ThreadPool` statistics change over time using [an example `ThreadPoolLogger`](https://github.com/JonCole/SampleCode/blob/master/ThreadPoolMonitor/ThreadPoolLogger.cs). You can use `TimeoutException` messages from StackExchange.Redis like below to further investigate:
47+
Monitor how your `ThreadPool` statistics change over time using [an example `ThreadPoolLogger`](https://github.com/JonCole/SampleCode/blob/master/ThreadPoolMonitor/ThreadPoolLogger.cs). You can use `TimeoutException` messages from StackExchange.Redis to further investigate:
4448

4549
```output
4650
System.TimeoutException: Timeout performing EVAL, inst: 8, mgr: Inactive, queue: 0, qu: 0, qs: 0, qc: 0, wr: 0, wq: 0, in: 64221, ar: 0,
@@ -50,7 +54,7 @@ Monitor how your `ThreadPool` statistics change over time using [an example `Thr
5054
In the preceding exception, there are several issues that are interesting:
5155

5256
- Notice that in the `IOCP` section and the `WORKER` section you have a `Busy` value that is greater than the `Min` value. This difference means your `ThreadPool` settings need adjusting.
53-
- You can also see `in: 64221`. This value indicates that 64,221 bytes have been received at the client's kernel socket layer but haven't been read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
57+
- You can also see `in: 64221`. This value indicates that 64,221 bytes were received at the client's kernel socket layer but weren't read by the application. This difference typically means that your application (for example, StackExchange.Redis) isn't reading data from the network as quickly as the server is sending it to you.
5458

5559
You can [configure your `ThreadPool` Settings](cache-management-faq.yml#important-details-about-threadpool-growth) to make sure that your thread pool scales up quickly under burst scenarios.
5660

@@ -61,16 +65,16 @@ For information about using multiple keys and smaller values, see [Consider more
6165
You can use the `redis-cli --bigkeys` command to check for large keys in your cache. For more information, see [redis-cli, the Redis command line interface--Redis](https://redis.io/topics/rediscli).
6266

6367
- Increase the size of your VM to get higher bandwidth capabilities
64-
- More bandwidth on your client or server VM may reduce data transfer times for larger responses.
65-
- Compare your current network usage on both machines to the limits of your current VM size. More bandwidth on only the server or only on the client may not be enough.
68+
- More bandwidth on your client or server VM might reduce data transfer times for larger responses.
69+
- Compare your current network usage on both machines to the limits of your current VM size. More bandwidth on only the server or only on the client might not be enough.
6670
- Increase the number of connection objects your application uses.
6771
- Use a round-robin approach to make requests over different connection objects
6872

6973
### High CPU on client hosts
7074

71-
High client CPU usage indicates the system can't keep up with the work it's been asked to do. Even though the cache sent the response quickly, the client may fail to process the response in a timely fashion. Our recommendation is to keep client CPU below 80%. Check the metric "Errors" (Type: `UnresponsiveClients`) to determine if your client hosts can process responses from Redis server in time.
75+
High client CPU usage indicates the system can't keep up with the work assigned to it. Even though the cache sent the response quickly, the client might fail to process the response in a timely fashion. Our recommendation is to keep client CPU less 80%. Check the metric "Errors" (Type: `UnresponsiveClients`) to determine if your client hosts can process responses from Redis server in time.
7276

73-
Monitor the client's system-wide CPU usage using metrics available in the Azure portal or through performance counters on the machine. Be careful not to monitor *process* CPU because a single process can have low CPU usage but the system-wide CPU can be high. Watch for spikes in CPU usage that correspond with timeouts. High CPU may also cause high `in: XXX` values in `TimeoutException` error messages as described in the [[Traffic burst](#traffic-burst-and-thread-pool-configuration)] section.
77+
Monitor the client's system-wide CPU usage using metrics available in the Azure portal or through performance counters on the machine. Be careful not to monitor process CPU because a single process can have low CPU usage but the system-wide CPU can be high. Watch for spikes in CPU usage that correspond with timeouts. High CPU might also cause high `in: XXX` values in `TimeoutException` error messages as described in the [[Traffic burst](#traffic-burst-and-thread-pool-configuration)] section.
7478

7579
> [!NOTE]
7680
> StackExchange.Redis 1.1.603 and later includes the `local-cpu` metric in `TimeoutException` error messages. Ensure you are using the latest version of the [StackExchange.Redis NuGet package](https://www.nuget.org/packages/StackExchange.Redis/). Bugs are regularly fixed in the code to make it more robust to timeouts. Having the latest version is important.
@@ -83,9 +87,9 @@ To mitigate a client's high CPU usage:
8387

8488
### Network bandwidth limitation on client hosts
8589

86-
Depending on the architecture of client machines, they may have limitations on how much network bandwidth they have available. If the client exceeds the available bandwidth by overloading network capacity, then data isn't processed on the client side as quickly as the server is sending it. This situation can lead to timeouts.
90+
Depending on the architecture of client machines, they might have limitations on how much network bandwidth they have available. If the client exceeds the available bandwidth by overloading network capacity, then data isn't processed on the client side as quickly as the server is sending it. This situation can lead to timeouts.
8791

88-
Monitor how your Bandwidth usage change over time using [an example `BandwidthLogger`](https://github.com/JonCole/SampleCode/blob/master/BandWidthMonitor/BandwidthLogger.cs). This code may not run successfully in some environments with restricted permissions (like Azure web sites).
92+
Monitor how your Bandwidth usage change over time using [an example `BandwidthLogger`](https://github.com/JonCole/SampleCode/blob/master/BandWidthMonitor/BandwidthLogger.cs). This code might not run successfully in some environments with restricted permissions (like Azure web sites).
8993

9094
To mitigate, reduce network bandwidth consumption or increase the client VM size to one with more network capacity. For more information, see [Large request or response size](cache-best-practices-development.md#large-request-or-response-size).
9195

@@ -95,7 +99,7 @@ Because of optimistic TCP settings in Linux, client applications hosted on Linux
9599

96100
### RedisSessionStateProvider retry timeout
97101

98-
If you're using `RedisSessionStateProvider`, ensure you have set the retry timeout correctly. The `retryTimeoutInMilliseconds` value should be higher than the `operationTimeoutInMilliseconds` value. Otherwise, no retries occur. In the following example, `retryTimeoutInMilliseconds` is set to 3000. For more information, see [ASP.NET Session State Provider for Azure Cache for Redis](cache-aspnet-session-state-provider.md) and [How to use the configuration parameters of Session State Provider and Output Cache Provider](https://github.com/Azure/aspnet-redis-providers/wiki/Configuration).
102+
If you're using `RedisSessionStateProvider`, ensure you set the retry timeout correctly. The `retryTimeoutInMilliseconds` value should be higher than the `operationTimeoutInMilliseconds` value. Otherwise, no retries occur. In the following example, `retryTimeoutInMilliseconds` is set to 3000. For more information, see [ASP.NET Session State Provider for Azure Cache for Redis](cache-aspnet-session-state-provider.md) and [How to use the configuration parameters of Session State Provider and Output Cache Provider](https://github.com/Azure/aspnet-redis-providers/wiki/Configuration).
99103

100104
```xml
101105
<add
@@ -115,17 +119,19 @@ If you're using `RedisSessionStateProvider`, ensure you have set the retry timeo
115119

116120
## Server-side troubleshooting
117121

122+
Here's the server-side troubleshooting.
123+
118124
### Server maintenance
119125

120-
Planned or unplanned maintenance can cause disruptions with client connections. The number and type of exceptions depends on the location of the request in the code path, and when the cache closes its connections. For instance, an operation that sends a request but hasn't received a response when the failover occurs might get a time-out exception. New requests on the closed connection object receive connection exceptions until the reconnection happens successfully.
126+
Planned or unplanned maintenance can cause disruptions with client connections. The number and type of exceptions depends on the location of the request in the code path, and when the cache closes its connections. For instance, an operation that sends a request but doesn't receive a response when the failover occurs might get a time-out exception. New requests on the closed connection object receive connection exceptions until the reconnection happens successfully.
121127

122128
For more information, check these other sections:
123129

124130
- [Update channel and Schedule updates](cache-administration.md#update-channel-and-schedule-updates)
125131
- [Connection resilience](cache-best-practices-connection.md#connection-resilience)
126132
- `AzureRedisEvents` [notifications](cache-failover.md#can-i-be-notified-in-advance-of-planned-maintenance)
127133

128-
To check whether your Azure Cache for Redis had a failover during when timeouts occurred, check the metric **Errors**. On the Resource menu of the Azure portal, select **Metrics**. Then create a new chart measuring the `Errors` metric, split by `ErrorType`. Once you have created this chart, you see a count for **Failover**.
134+
To check whether your Azure Cache for Redis had a failover during when timeouts occurred, check the metric **Errors**. On the Resource menu of the Azure portal, select **Metrics**. Then create a new chart measuring the `Errors` metric, split by `ErrorType`. Once you create this chart, you see a count for **Failover**.
129135

130136
For more information on failovers, see [Failover and patching for Azure Cache for Redis](cache-failover.md).
131137

@@ -137,8 +143,13 @@ High server load means the Redis server is unable to keep up with the requests,
137143

138144
There are several changes you can make to mitigate high server load:
139145

140-
- Investigate what is causing high server load such as [long-running commands](#long-running-commands), noted below because of high memory pressure.
146+
- Investigate what is causing high server load such as [long-running commands](#long-running-commands), noted in this article, because of high memory pressure.
141147
- [Scale](cache-how-to-scale.md) out to more shards to distribute load across multiple Redis processes or scale up to a larger cache size with more CPU cores. For more information, see [Azure Cache for Redis planning FAQs](./cache-planning-faq.yml).
148+
- If your production workload on a _C1_ cache is negatively affected by extra latency from virus scanning, you can reduce the effect by to pay for a higher tier offering with multiple CPU cores, such as _C2_.
149+
150+
#### Spikes in server load
151+
152+
On _C0_ and _C1_ caches, you might see short spikes in server load not caused by an increase in requests a couple times a day while virus scanning is running on the VMs. You see higher latency for requests while virus scanning is happening on these tiers. Caches on the _C0_ and _C1_ tiers only have a single core to multitask, dividing the work of serving virus scanning and Redis requests.
142153

143154
### High memory usage
144155

@@ -155,9 +166,9 @@ Using the [SLOWLOG GET](https://redis.io/commands/slowlog-get) command, you can
155166
Customers can use a console to run these Redis commands to investigate long running and expensive commands.
156167

157168
- [SLOWLOG](https://redis.io/commands/slowlog) is used to read and reset the Redis slow queries log. It can be used to investigate long running commands on client side.
158-
The Redis Slow Log is a system to log queries that exceeded a specified execution time. The execution time does not include I/O operations like talking with the client, sending the reply, and so forth, but just the time needed to actually execute the command. Using the SLOWLOG command, Customers can measure/log expensive commands being executed against their Redis server.
169+
The Redis Slow Log is a system to log queries that exceeded a specified execution time. The execution time doesn't include I/O operations like talking with the client, sending the reply, and so forth, but just the time needed to actually execute the command. Customers can measure/log expensive commands being executed against their Redis server using the `SLOWLOG` command.
159170
- [MONITOR](https://redis.io/commands/monitor) is a debugging command that streams back every command processed by the Redis server. It can help in understanding what is happening to the database. This command is demanding and can negatively affect performance. It can degrade performance.
160-
- [INFO](https://redis.io/commands/info) - command returns information and statistics about the server in a format that is simple to parse by computers and easy to read by humans. In this case, the CPU section could be useful to investigate the CPU usage. A **server_load** of 100 (maximum value) signifies that the Redis server has been busy all the time (has not been idle) processing the requests.
171+
- [INFO](https://redis.io/commands/info) - command returns information and statistics about the server in a format that is simple to parse by computers and easy to read by humans. In this case, the CPU section could be useful to investigate the CPU usage. A server load of 100 (maximum value) signifies that the Redis server was busy all the time and was never idle when processing the requests.
161172

162173
Output sample:
163174

@@ -177,7 +188,7 @@ event_no_wait_count:1
177188

178189
### Network bandwidth limitation
179190

180-
Different cache sizes have different network bandwidth capacities. If the server exceeds the available bandwidth, then data won't be sent to the client as quickly. Client requests could time out because the server can't push data to the client fast enough.
191+
Different cache sizes have different network bandwidth capacities. If the server exceeds the available bandwidth, then data isn't sent to the client as quickly. Client requests could time out because the server can't push data to the client fast enough.
181192

182193
The "Cache Read" and "Cache Write" metrics can be used to see how much server-side bandwidth is being used. You can [view these metrics](cache-how-to-monitor.md#view-cache-metrics) in the portal. [Create alerts](cache-how-to-monitor.md#create-alerts) on metrics like cache read or cache write to be notified early about potential impacts.
183194

0 commit comments

Comments
 (0)