You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/en/agent/guide/agent-retry.md
+49-46Lines changed: 49 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,6 @@
1
1
---
2
2
title: Agent Retry and Buffering Logic
3
+
description: Follow this guide to learn how the Agent addresses retry strategies and backoff behavior, buffering mechanisms and limits, data drop conditions and loss scenarios.
This guide describes the Datadog Agent's behavior when it fails to send HTTP requests to the **Metrics**, **Logs**, **APM**, and **Processes** intake endpoints.
17
+
This guide describes the Datadog Agent's behavior when it fails to send HTTP requests to the Metrics, Logs, APM, and Processes intake endpoints.
17
18
18
-
Follow this guide to learn how the Agent addresses:
19
-
- Retry strategies and backoff behavior
20
-
- Buffering mechanisms and limits
21
-
- Data drop conditions and loss scenarios
22
-
23
-
All retry strategies use exponential backoff with randomized jitter. See the <ahref="https://github.com/DataDog/datadog-agent/blob/main/pkg/util/backoff/backoff.go">backoff implementation</a> for details.
19
+
All retry strategies use exponential backoff with randomized jitter. See the [backoff implementation][2] for details.
24
20
25
21
<divclass="alert alert-info"> A failed HTTP request in this guide refers to any request that does not result in a <code>2xx</code> HTTP response. </div>
The Agent retries failed HTTP requests using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the metrics intake:
29
+
The Agent retries failed HTTP requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the metrics intake:
32
30
- Base backoff time: 2 seconds
33
31
- Maximum backoff time: [64 seconds][3]
34
32
- Maximum backoff time is reached after 6 retries
@@ -42,18 +40,18 @@ The Agent retries failed requests for the following scenarios:
42
40
<br>
43
41
Requests that return a <code>404</code> response are retried because they often indicate a configuration or availability issue that could be resolved.
44
42
</div>
45
-
{{% /collapse-content %}}
46
43
47
-
{{% collapse-content title="Metrics buffering mechanisms and limits" level="h4" expanded=false %}}
44
+
45
+
### Metrics buffering mechanisms and limits
48
46
49
47
When the Agent fails to send a metric to the Datadog intake, it compresses and stores this metric in an in-memory retry buffer. See [Buffer configurations](#buffer-configurations) for the available settings.
50
48
51
49
The Agent also supports an optional [on-disk retry buffer][4]. If you enable this setting, the Agent:
52
50
1. Fills the in-memory buffer until it is full
53
51
1. Evicts older payloads from memory and serializes them to disk
54
52
1. Retries payloads in the following order:
55
-
- In-memory payloads (newest first)
56
-
- On-disk payloads (newest first)
53
+
1. In-memory payloads (newest first)
54
+
1. On-disk payloads (newest first)
57
55
58
56
This prioritization helps ensure that the Agent sends recent and live metrics before it backfills older data.
59
57
@@ -75,22 +73,23 @@ During shutdown, the Agent:
75
73
- Flushes in-flight requests
76
74
- Does not flush payloads in retry queues (both in-memory and on-disk)
The Logs Agent retries failed HTTP requests indefinitely using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the logs intake:
84
+
The Logs Agent retries failed HTTP requests indefinitely using an exponential backoff strategy. The Agent uses the following default retry configurations for the logs intake:
84
85
- Base backoff time: 2 seconds
85
86
- Maximum backoff time: 120 seconds
86
87
87
88
The Agent retries failed log payloads until the logs intake endpoint becomes available.
88
89
89
90
<divclass="alert alert-info"> The Logs Agent <strong>does not retry</strong> requests with status codes <code>400</code>, <code>401</code>, <code>403</code>, <code>413</code>.</div>
90
-
{{% /collapse-content %}}
91
91
92
-
93
-
{{% collapse-content title="Logs buffering mechanisms and limits" level="h4" expanded=false %}}
92
+
### Logs buffering mechanisms and limits
94
93
95
94
#### Backpressure and consumption
96
95
The Logs Agent is designed to guarantee log delivery during transmission. When a payload fails to send, the Agent applies backpressure and stops reading from the log source. When the intake becomes available, the Agent resumes reading from the last known position.
@@ -111,23 +110,24 @@ The Logs Agent is designed to guarantee log delivery during transmission. When a
111
110
The Logs Agent maintains a registry that tracks log sources and current read offsets. The Agent flushes the registry to disk every second and reloads it when the Agent restarts. This process is not configurable.
112
111
113
112
On restart, the Agent resumes reading from the position recorded in the registry. A small number of duplicate logs may occur if the Agent sends a payload before flushing the registry.
The Agent retries failed APM requests using an [exponential backoff strategy][2]. The Agent uses the following default retry configurations for the APM intake:
128
+
{{% tab "APM" %}}
129
+
### APM retry strategy
130
+
The Agent retries failed APM requests using an exponential backoff strategy. The Agent uses the following default retry configurations for the APM intake:
131
131
- Base backoff time: 2 seconds
132
132
- Maximum backoff time: 10 seconds
133
133
@@ -137,9 +137,8 @@ The Agent retries failed requests for the following scenarios:
137
137
- HTTP `5xx` responses
138
138
139
139
<divclass="alert alert-info"> You <strong>cannot configure</strong> the retry behavior and retriable status codes for APM.</div>
140
-
{{% /collapse-content %}}
141
140
142
-
{{% collapse-content title="APM buffering mechanisms and limits" level="h4" expanded=false %}}
141
+
### APM buffering mechanisms and limits
143
142
144
143
#### In-memory queues
145
144
The Agent compresses and stores failed APM payloads in memory. The Agent drops these failed payloads when queues are full.
@@ -155,29 +154,32 @@ The Agent compresses and stores failed APM payloads in memory. The Agent drops t
The Agent retries failed processes requests using an [exponential backoff strategy][2]. The Agent uses the same default retry configurations as the metrics intake:
172
+
The Agent retries failed processes requests using an exponential backoff strategy. The Agent uses the same default retry configurations as the metrics intake:
171
173
- Base backoff time: 2 seconds
172
174
- Maximum backoff time: [64 seconds][3]
173
175
- Maximum backoff time is reached after 6 retries
174
176
175
177
**Key difference from Metrics**: On-disk buffering is not supported for Processes.
176
178
177
-
See the [Metrics retry strategy](#metrics-retry-strategy) for complete details on retry scenarios and exceptions.
178
-
{{% /collapse-content %}}
179
+
See the Metrics retry strategy for complete details on retry scenarios and exceptions.
179
180
180
-
{{% collapse-content title="Processes buffering mechanisms and limits" level="h4" expanded=false %}}
181
+
182
+
### Processes buffering mechanisms and limits
181
183
182
184
The Process Agent uses the **metrics forwarder** for downstream delivery. Before forwarding check results, the Process Agent stores them in an in-memory queue.
183
185
@@ -203,12 +205,13 @@ With checks running every 10 seconds, these settings buffer approximately 30 min
203
205
- Each payload type has independent buffer limits
204
206
- Approximately 40 minutes of process data can be buffered with default settings
0 commit comments