Skip to content

Commit 30198de

Browse files
committed
changes from review
1 parent de94bbe commit 30198de

File tree

1 file changed

+49
-46
lines changed

1 file changed

+49
-46
lines changed

content/en/agent/guide/agent-retry.md

Lines changed: 49 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
---
22
title: Agent Retry and Buffering Logic
3+
description: Follow this guide to learn how the Agent addresses retry strategies and backoff behavior, buffering mechanisms and limits, data drop conditions and loss scenarios.
34
further_reading:
45
- link: "agent/remote_config/?tab=configurationyamlfile"
56
tag: "Documentation"
@@ -13,22 +14,19 @@ further_reading:
1314
---
1415
## Overview
1516

16-
This guide describes the Datadog Agent's behavior when it fails to send HTTP requests to the **Metrics**, **Logs**, **APM**, and **Processes** intake endpoints.
17+
This guide describes the Datadog Agent's behavior when it fails to send HTTP requests to the Metrics, Logs, APM, and Processes intake endpoints.
1718

18-
Follow this guide to learn how the Agent addresses:
19-
- Retry strategies and backoff behavior
20-
- Buffering mechanisms and limits
21-
- Data drop conditions and loss scenarios
22-
23-
All retry strategies use exponential backoff with randomized jitter. See the <a href="https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go">backoff implementation</a> for details.
19+
All retry strategies use exponential backoff with randomized jitter. See the [backoff implementation][2] for details.
2420

2521
<div class="alert alert-info"> A failed HTTP request in this guide refers to any request that does not result in a <code>2xx</code> HTTP response. </div>
2622

2723

28-
## Metrics
29-
{{% collapse-content title="Metrics retry strategy" level="h4" expanded=false %}}
24+
{{< tabs >}}
25+
{{% tab "Metrics" %}}
26+
27+
### Metrics retry strategy
3028

31-
The Agent retries failed HTTP requests using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the metrics intake:
29+
The Agent retries failed HTTP requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the metrics intake:
3230
- Base backoff time: 2 seconds
3331
- Maximum backoff time: [64 seconds][3]
3432
- Maximum backoff time is reached after 6 retries
@@ -42,18 +40,18 @@ The Agent retries failed requests for the following scenarios:
4240
<br>
4341
Requests that return a <code>404</code> response are retried because they often indicate a configuration or availability issue that could be resolved.
4442
</div>
45-
{{% /collapse-content %}}
4643

47-
{{% collapse-content title="Metrics buffering mechanisms and limits" level="h4" expanded=false %}}
44+
45+
### Metrics buffering mechanisms and limits
4846

4947
When the Agent fails to send a metric to the Datadog intake, it compresses and stores this metric in an in-memory retry buffer. See [Buffer configurations](#buffer-configurations) for the available settings.
5048

5149
The Agent also supports an optional [on-disk retry buffer][4]. If you enable this setting, the Agent:
5250
1. Fills the in-memory buffer until it is full
5351
1. Evicts older payloads from memory and serializes them to disk
5452
1. Retries payloads in the following order:
55-
- In-memory payloads (newest first)
56-
- On-disk payloads (newest first)
53+
1. In-memory payloads (newest first)
54+
1. On-disk payloads (newest first)
5755

5856
This prioritization helps ensure that the Agent sends recent and live metrics before it backfills older data.
5957

@@ -75,22 +73,23 @@ During shutdown, the Agent:
7573
- Flushes in-flight requests
7674
- Does not flush payloads in retry queues (both in-memory and on-disk)
7775

78-
{{% /collapse-content %}}
7976

80-
## Logs
81-
{{% collapse-content title="Logs retry strategy" level="h4" expanded=false %}}
77+
[3]: https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go#L47
78+
[4]: /agent/configuration/network/#data-buffering
79+
{{% /tab %}}
80+
81+
{{% tab "Logs" %}}
82+
### Logs retry strategy
8283

83-
The Logs Agent retries failed HTTP requests indefinitely using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the logs intake:
84+
The Logs Agent retries failed HTTP requests indefinitely using an exponential backoff strategy. The Agent uses the following default retry configurations for the logs intake:
8485
- Base backoff time: 2 seconds
8586
- Maximum backoff time: 120 seconds
8687

8788
The Agent retries failed log payloads until the logs intake endpoint becomes available.
8889

8990
<div class="alert alert-info"> The Logs Agent <strong>does not retry</strong> requests with status codes <code>400</code>, <code>401</code>, <code>403</code>, <code>413</code>.</div>
90-
{{% /collapse-content %}}
9191

92-
93-
{{% collapse-content title="Logs buffering mechanisms and limits" level="h4" expanded=false %}}
92+
### Logs buffering mechanisms and limits
9493

9594
#### Backpressure and consumption
9695
The Logs Agent is designed to guarantee log delivery during transmission. When a payload fails to send, the Agent applies backpressure and stops reading from the log source. When the intake becomes available, the Agent resumes reading from the last known position.
@@ -111,23 +110,24 @@ The Logs Agent is designed to guarantee log delivery during transmission. When a
111110
The Logs Agent maintains a registry that tracks log sources and current read offsets. The Agent flushes the registry to disk every second and reloads it when the Agent restarts. This process is not configurable.
112111

113112
On restart, the Agent resumes reading from the position recorded in the registry. A small number of duplicate logs may occur if the Agent sends a payload before flushing the registry.
114-
{{% /collapse-content %}}
115113

116-
{{% collapse-content title="Advanced shipping configuration" level="h4" expanded=false %}}
114+
### Advanced shipping configuration
117115

118116
#### Dual shipping
119-
When you enable dual shipping:
117+
When you enable [dual shipping][9]:
120118
- The Agent sends logs to the first available endpoint
121119
- The Agent drops payloads for any endpoint that fails
122120
- Log consumption continues as long as at least one endpoint succeeds
123121

124122
For the Agent logic when `is_reliable` is enabled, see [Logs Dual Shipping][8].
125123

126-
{{% /collapse-content %}}
124+
[8]: https://docs.datadoghq.com/agent/configuration/dual-shipping/?tab=helm#environment-variable-configuration-6
125+
[9]: /agent/configuration/dual-shipping/?tab=helm&site=us
126+
{{% /tab %}}
127127

128-
## APM
129-
{{% collapse-content title="APM retry strategy" level="h4" expanded=false %}}
130-
The Agent retries failed APM requests using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the APM intake:
128+
{{% tab "APM" %}}
129+
### APM retry strategy
130+
The Agent retries failed APM requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the APM intake:
131131
- Base backoff time: 2 seconds
132132
- Maximum backoff time: 10 seconds
133133

@@ -137,9 +137,8 @@ The Agent retries failed requests for the following scenarios:
137137
- HTTP `5xx` responses
138138

139139
<div class="alert alert-info"> You <strong>cannot configure</strong> the retry behavior and retriable status codes for APM.</div>
140-
{{% /collapse-content %}}
141140

142-
{{% collapse-content title="APM buffering mechanisms and limits" level="h4" expanded=false %}}
141+
### APM buffering mechanisms and limits
143142

144143
#### In-memory queues
145144
The Agent compresses and stores failed APM payloads in memory. The Agent drops these failed payloads when queues are full.
@@ -155,29 +154,32 @@ The Agent compresses and stores failed APM payloads in memory. The Agent drops t
155154
- Default calculation:
156155
- `int(max(1, max memory / payload size))`
157156
- Example: `int(max(1, (250 * 1024 * 1024) / 1500000)) = 174` [payloads][7]
158-
{{% /collapse-content %}}
159157

160-
{{% collapse-content title="Advanced shipping configuration" level="h4" expanded=false %}}
158+
### Advanced shipping configuration
161159

162160
#### Dual shipping
163-
When you enable dual shipping for the APM intake, each endpoint has an independent sender and queue.
161+
When you enable [dual shipping][9] for the APM intake, each endpoint has an independent sender and queue.
162+
164163

165-
{{% /collapse-content %}}
164+
[6]: https://github.com/DataDog/datadog-agent/blob/7.43.1/pkg/trace/writer/trace.go#L107-L116
165+
[7]: https://github.com/DataDog/datadog-agent/blob/7.43.1/pkg/trace/writer/stats.go#L73-L83
166+
[9]: /agent/configuration/dual-shipping/?tab=helm&site=us
167+
{{% /tab %}}
166168

167-
## Processes
168-
{{% collapse-content title="Processes retry strategy" level="h4" expanded=false %}}
169+
{{% tab "Processes" %}}
170+
### Processes retry strategy
169171

170-
The Agent retries failed processes requests using an [exponential backoff strategy][2]. The Agent uses the same default retry configurations as the metrics intake:
172+
The Agent retries failed processes requests using an exponential backoff strategy. The Agent uses the same default retry configurations as the metrics intake:
171173
- Base backoff time: 2 seconds
172174
- Maximum backoff time: [64 seconds][3]
173175
- Maximum backoff time is reached after 6 retries
174176

175177
**Key difference from Metrics**: On-disk buffering is not supported for Processes.
176178

177-
See the [Metrics retry strategy](#metrics-retry-strategy) for complete details on retry scenarios and exceptions.
178-
{{% /collapse-content %}}
179+
See the Metrics retry strategy for complete details on retry scenarios and exceptions.
179180

180-
{{% collapse-content title="Processes buffering mechanisms and limits" level="h4" expanded=false %}}
181+
182+
### Processes buffering mechanisms and limits
181183

182184
The Process Agent uses the **metrics forwarder** for downstream delivery. Before forwarding check results, the Process Agent stores them in an in-memory queue.
183185

@@ -203,12 +205,13 @@ With checks running every 10 seconds, these settings buffer approximately 30 min
203205
- Each payload type has independent buffer limits
204206
- Approximately 40 minutes of process data can be buffered with default settings
205207

206-
{{% /collapse-content %}}
207208

208-
[2]: https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go
209209
[3]: https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go#L47
210-
[4]: https://docs.datadoghq.com/agent/configuration/network/#data-buffering
211210
[5]: https://github.com/DataDog/datadog-agent/blob/main/pkg/config/setup/process.go#L34-L36
212-
[6]: https://github.com/DataDog/datadog-agent/blob/7.43.1/pkg/trace/writer/trace.go#L107-L116
213-
[7]: https://github.com/DataDog/datadog-agent/blob/7.43.1/pkg/trace/writer/stats.go#L73-L83
214-
[8]: https://docs.datadoghq.com/agent/configuration/dual-shipping/?tab=helm#environment-variable-configuration-6
211+
{{% /tab %}}
212+
{{< /tabs >}}
213+
214+
215+
[2]: https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go
216+
217+

0 commit comments

Comments
 (0)