You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Update the `agent.yaml`**: In the **Agent terminal** window, add the `file_storage` extension and name it `checkpoint`:
13
+
> [!IMPORTANT]
14
+
> **Change _ALL_ terminal windows to the `2-building-resilience` directory and run the `clear` command.**
15
+
16
+
Your directory structure will look like this:
17
+
18
+
```text { title="Updated Directory Structure" }
19
+
.
20
+
├── agent.yaml
21
+
└── gateway.yaml
22
+
```
23
+
24
+
**Update the `agent.yaml`**: In the **Agent terminal** window, add the `file_storage` extension under the existing `health_check` extension:
14
25
15
26
```yaml
16
27
file_storage/checkpoint: # Extension Type/Name
@@ -24,11 +35,9 @@ While these components do not process telemetry data directly, they provide valu
24
35
max_transaction_size: 65536# Max. size limit before compaction occurs
25
36
```
26
37
27
-
**Add `file_storage` to existing `otlphttp` exporter**: Modify the `otlphttp:` exporter to configure retry and queuing mechanisms, ensuring data is retained and resent if failures occur:
38
+
**Add `file_storage` to the exporter**: Modify the `otlphttp` exporter to configure retry and queuing mechanisms, ensuring data is retained and resent if failures occur. Add the following under the `endpoint: "http://localhost:5318"` and make sure the indentation matches `endpoint`:
28
39
29
40
```yaml
30
-
otlphttp:
31
-
endpoint: "http://localhost:5318"
32
41
retry_on_failure:
33
42
enabled: true # Enable retry on failure
34
43
sending_queue: #
@@ -38,7 +47,7 @@ While these components do not process telemetry data directly, they provide valu
**Update the `services` section**: Add the `file_storage/checkpoint` extension to the existing `extensions:` section. This will cause the extension to be enabled:
50
+
**Update the `services` section**: Add the `file_storage/checkpoint` extension to the existing `extensions:` section and the configuration needs to look like this:
42
51
43
52
```yaml
44
53
service:
@@ -47,18 +56,18 @@ service:
47
56
- file_storage/checkpoint # Enabled extensions for this collector
48
57
```
49
58
50
-
**Update the `metrics` pipeline**: For this exercise we are going to comment out the `hostmetrics` receiver from the Metric pipeline to reduce debug and log noise:
59
+
**Update the `metrics` pipeline**: For this exercise we are going to comment out the `hostmetrics` receiver from the Metric pipeline to reduce debug and log noise, again the configuration needs to look like this:
51
60
52
61
```yaml
53
62
metrics:
54
63
receivers:
64
+
# - hostmetrics # Hostmetric reciever (cpu only)
55
65
- otlp
56
-
# - hostmetrics # Hostmetrics Receiver
57
66
```
58
67
59
68
{{% /notice %}}
60
69
61
-
Validate the **Agent** configuration using **[otelbin.io](https://www.otelbin.io/)**. For reference, the `metrics:` section of your pipelines will look similar to this:
70
+
<!-- Validate the **Agent** configuration using **[otelbin.io](https://www.otelbin.io/)**. For reference, the `metrics:` section of your pipelines will look similar to this:
Copy file name to clipboardExpand all lines: content/en/conf/1-advanced-collector/2-building-resilience/2-3-failure.md
+4-13Lines changed: 4 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,18 +6,12 @@ weight: 3
6
6
7
7
To assess the **Agent's** resilience, we'll simulate a temporary **Gateway** outage and observe how the **Agent** handles it:
8
8
9
-
**Summary**:
10
-
11
-
1.**Send Traces to the Agent** – Generate traffic by sending traces to the **Agent**.
12
-
2.**Stop the Gateway** – This will trigger the **Agent** to enter retry mode.
13
-
3.**Restart the Gateway** – The **Agent** will recover traces from its persistent queue and forward them successfully. Without the persistent queue, these traces would have been lost permanently.
**Simulate a network failure**: In the **Gateway terminal** stop the **Gateway** with `Ctrl-C` and wait until the gateway console shows that it has stopped:
11
+
**Simulate a network failure**: In the **Gateway terminal** stop the **Gateway** with `Ctrl-C` and wait until the gateway console shows that it has stopped. The **Agent** will continue running, but it will not be able to send data to the gateway. The output in the **Gateway terminal** should look similar to this:
18
12
19
13
```text
20
-
2025-01-28T13:24:32.785+0100 info service@v0.120.0/service.go:309 Shutdown complete.
14
+
2025-07-09T10:22:37.941Z info service@v0.126.0/service.go:345 Shutdown complete. {"resource": {}}
21
15
```
22
16
23
17
**Send traces**: In the **Loadgen terminal** window send five more traces using the `loadgen`.
@@ -31,16 +25,13 @@ Notice that the agent’s retry mechanism is activated as it continuously attemp
31
25
**Stop the Agent**: In the **Agent terminal** window, use `Ctrl-C` to stop the agent. Wait until the agent’s console confirms it has stopped:
32
26
33
27
```text
34
-
2025-01-28T14:40:28.702+0100 info extensions/extensions.go:66 Stopping extensions...
35
-
2025-01-28T14:40:28.702+0100 info [email protected]/service.go:309 Shutdown complete.
28
+
2025-07-09T10:25:59.344Z info [email protected]/service.go:345 Shutdown complete. {"resource": {}}
Stopping the agent will halt its retry attempts and prevent any future retry activity.
33
+
By stopping the agent will halt its retry attempts and prevent any future retry activity.
42
34
43
35
If the agent runs for too long without successfully delivering data, it may begin dropping traces, depending on the retry configuration, to conserve memory. By stopping the agent, any metrics, traces, or logs currently stored in memory are lost before being dropped, ensuring they remain available for recovery.
44
36
45
37
This step is essential for clearly observing the recovery process when the agent is restarted.
0 commit comments