Skip to content

Commit cfb3638

Browse files
committed
draft resillience
1 parent 899700e commit cfb3638

File tree

2 files changed

+116
-95
lines changed

2 files changed

+116
-95
lines changed
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
---
2+
title: Test your Resilience Setup
3+
linkTitle: 4.1 Testing the Setup
4+
weight: 1
5+
---
6+
3. **Run the Collector:**
7+
Now, run the OpenTelemetry Collector using the configuration file you just created. You can do this by executing the following command in your terminal:
8+
9+
```bash
10+
otelcol --config agent.yaml
11+
```
12+
13+
This will start the collector with the configurations specified in the YAML file.
14+
15+
### Step 4: Testing the Resilience
16+
17+
To test the resilience built into the system:
18+
19+
1. **Simulate Network Failure:**
20+
Temporarily stop the OTLP receiver or shut down the endpoint where the telemetry data is being sent. You should see the retry mechanism kicking in, as the collector will attempt to resend the data.
21+
22+
2. **Check the Checkpoint Folder:**
23+
After a few retries, inspect the `./checkpoint-folder` directory. You should see checkpoint files stored there, which contain the serialized state of the queue.
24+
25+
3. **Restart the Collector:**
26+
Restart the OpenTelemetry Collector after stopping the OTLP receiver. The collector will resume sending data from the last checkpointed state, without losing any data.
27+
28+
4. **Inspect Logs and Files:**
29+
Inspect the logs to see the retry attempts. The `debug` exporter will output detailed logs, which should show retry attempts and any failures.
30+
31+
### Step 5: Fine-Tuning the Configuration for Production
32+
33+
- **Timeouts and Interval Adjustments:**
34+
You may want to adjust the `retry_on_failure` parameters for different network environments. In high-latency environments, increasing the `max_interval` might reduce unnecessary retries.
35+
36+
```yaml
37+
retry_on_failure:
38+
enabled: true
39+
initial_interval: 1s
40+
max_interval: 5s
41+
max_elapsed_time: 20s
42+
```
43+
44+
- **Compaction and Transaction Size:**
45+
Depending on your use case, adjust the `max_transaction_size` for checkpoint compaction. A smaller transaction size will make checkpoint files more frequent but smaller, while a larger size might reduce disk I/O but require more memory.
46+
47+
### Step 6: Monitoring and Maintenance
48+
49+
- **Monitoring the Collector:**
50+
Use Prometheus or other monitoring tools to collect metrics from the OpenTelemetry Collector. You can monitor retries, the state of the sending queue, and other performance metrics to ensure the collector is behaving as expected.
51+
52+
- **Log Rotation:**
53+
The `file` exporter has a built-in log rotation mechanism to ensure that logs do not fill up your disk.
54+
55+
```yaml
56+
exporters:
57+
file:
58+
path: ./agent.out
59+
rotation:
60+
max_megabytes: 2
61+
max_backups: 2
62+
```
63+
64+
This configuration rotates the log file when it reaches 2 MB, and keeps up to two backups.
65+
66+
### Conclusion
67+
68+
In this section, you learned how to enhance the resilience of the OpenTelemetry Collector by configuring the `file_storage/checkpoint` extension, setting up retry mechanisms for the OTLP exporter, and using a sending queue backed by file storage for storing data during temporary failures.
69+
70+
By leveraging file storage for checkpointing and queue persistence, you can ensure that your telemetry pipeline can recover gracefully from failures, making it more reliable for production environments.

content/en/ninja-workshops/10-advanced-otel/40-resillience/_index.md

Lines changed: 46 additions & 95 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,23 @@ The goal is to show how this configuration allows your OpenTelemetry Collector t
1818

1919
### Setup
2020

21-
Create a new sub directory called `4-resilience` and copy the contents from `3-filelog` across.
22-
23-
[Insert file tree]
21+
Create a new sub directory called `4-resilience` and copy the content from `3-filelog` across and remove any *.out file. Your starting point for this exercise should be:
22+
23+
```text
24+
WORKSHOP
25+
├── 1-agent
26+
├── 2-gateway
27+
├── 3-filelog
28+
├── 4-resilience
29+
│   ├── agent.yaml
30+
│   ├── gateway.yaml
31+
│   ├── log-gen.sh
32+
│   ├── quotes.log
33+
│   └── trace.json
34+
└── otelcol
35+
```
2436

25-
We are going to update the agent.yaml we have by adding an `extensions:` section.
37+
In this exercise we are going to update the agent.yaml by adding an `extensions:` section.
2638
This new section in an OpenTelemetry configuration YAML is used to define optional components that enhance or modify the behavior of the OpenTelemetry Collector. These components don’t handle telemetry data directly but provide additional capabilities or services to the Collector.
2739
The first exercise will be providing **Checkpointing** with the `file_storage` extension.
2840
The `file_storage` extension is used to ensure that the OpenTelemetry Collector can persist checkpoints to disk. This is especially useful in cases where there are network failures or restarts. This way, the collector can recover from where it left off without losing data.
@@ -33,15 +45,17 @@ The `file_storage` extension is used to ensure that the OpenTelemetry Collector
3345

3446
{{% notice title="Exercise" style="green" icon="running" %}}
3547

36-
- Add the `extensions:` key at the top of the `agent.yaml` file
37-
- Add the`file_storage` key and name is `/checkpoint:`
38-
- Add the `directory:`key and set it to a value of `"./checkpoint-folder"`
39-
- Add the `create_directory:` key and set it to a value of `true`
40-
- Add the `timeout:` key and set it to a value of `1s`
41-
- Add the `compaction:` key
42-
- Add the `on_start:` key and set it to a value of `true`
43-
- Add the `directory:` key and set it to a value of `./checkpoint-folder`
44-
- Add the `max_transaction_size:` key and set it to a value of `65_536`
48+
Let's add the extension part first:
49+
50+
1. **Add** `extensions:` **section**: Place this at the top of the `agent.yaml`.
51+
2. **Add** `file_storage` **extension**: Under the **extensions** section. Name it `/checkpoint:`.
52+
3. **Add** `directory:` **key**: under the `file_storage` extension and set it to a value of `"./checkpoint-folder"`
53+
4. **Add** `create_directory:` **key**: Set the Value to `true`.
54+
5. **Add** `timeout:` **key**: Set the value to `1s`.
55+
6. **Add** `compaction:` **key**:
56+
7. **Add `on_start:` key**: Under the `compaction:` section, Set the value to `true`
57+
8. **Add** `directory:` **key**: Set the value to `./checkpoint-folder`.
58+
9. **Add** `max_transaction_size:` **key**: Set it to a value of `65_536`
4559

4660
{{% /notice%}}
4761

@@ -55,21 +69,31 @@ The `file_storage` extension is used to ensure that the OpenTelemetry Collector
5569

5670
The next exercise is modifying the `otlphttp:` exporter where retries and queueing are configured.
5771

72+
{{% notice title="Exercise" style="green" icon="running" %}}
73+
We are going to extend the existing `otlphttp` exporter:
74+
5875
```yaml
5976
exporters:
6077
otlphttp:
6178
endpoint: "localhost:5317"
62-
tls:
63-
insecure: true
64-
retry_on_failure:
65-
enabled: true
66-
sending_queue:
67-
enabled: true
68-
num_consumers: 10
69-
queue_size: 10000
70-
storage: file_storage/checkpoint
79+
headers:
80+
X-SF-Token: "FAKE_SPLUNK_ACCESS_TOKEN" # or your own version of a token
7181
```
7282
83+
**Steps:**
84+
85+
1. **Add** `tls:` **key**: Place at the same indent level as `headers:`.
86+
2. **Add** `insecure:` **key**: Under the `tls:` key and set its value to `true`.
87+
3. **Add** `retry_on_failure:` **key**:
88+
4. **Add** `enabled:` **key**: Under the `retry_on_failure:` key and set its value to `true`.
89+
5. **Add** `sending_queue:` **key**:
90+
6. **Add** `enabled:` **key**: Under the `sending_queue:` key and set its value to `true`.
91+
7. **Add** `num_consumers:` **key**: Set its value to `10`
92+
8. **Add** `queue_size:` **key**: Set its value to `10000`
93+
9. **Add** `storage:` **key**: Set its value to `file_storage/checkpoint`
94+
95+
{{% /notice%}}
96+
7397
**Explanation:**
7498

7599
- `retry_on_failure.enabled: true`: Enables retrying when there is a failure in sending data to the OTLP gateway.
@@ -78,76 +102,3 @@ exporters:
78102
- `queue_size: 10000`: The maximum size of the queue.
79103
- `storage: file_storage/checkpoint`: Specifies that the queue state will be backed up in the file system.
80104

81-
### Step 3: Running the OpenTelemetry Collector with the Configuration
82-
83-
1. **Create Checkpoint Folder:**
84-
Make sure that the folder `./checkpoint-folder` exists in your working directory. The OpenTelemetry Collector will use this folder to store checkpoint and transaction files.
85-
86-
2. **Save the Configuration:**
87-
Save the YAML configuration to a file, such as `agent.yaml`.
88-
89-
3. **Run the Collector:**
90-
Now, run the OpenTelemetry Collector using the configuration file you just created. You can do this by executing the following command in your terminal:
91-
92-
```bash
93-
otelcol --config agent.yaml
94-
```
95-
96-
This will start the collector with the configurations specified in the YAML file.
97-
98-
### Step 4: Testing the Resilience
99-
100-
To test the resilience built into the system:
101-
102-
1. **Simulate Network Failure:**
103-
Temporarily stop the OTLP receiver or shut down the endpoint where the telemetry data is being sent. You should see the retry mechanism kicking in, as the collector will attempt to resend the data.
104-
105-
2. **Check the Checkpoint Folder:**
106-
After a few retries, inspect the `./checkpoint-folder` directory. You should see checkpoint files stored there, which contain the serialized state of the queue.
107-
108-
3. **Restart the Collector:**
109-
Restart the OpenTelemetry Collector after stopping the OTLP receiver. The collector will resume sending data from the last checkpointed state, without losing any data.
110-
111-
4. **Inspect Logs and Files:**
112-
Inspect the logs to see the retry attempts. The `debug` exporter will output detailed logs, which should show retry attempts and any failures.
113-
114-
### Step 5: Fine-Tuning the Configuration for Production
115-
116-
- **Timeouts and Interval Adjustments:**
117-
You may want to adjust the `retry_on_failure` parameters for different network environments. In high-latency environments, increasing the `max_interval` might reduce unnecessary retries.
118-
119-
```yaml
120-
retry_on_failure:
121-
enabled: true
122-
initial_interval: 1s
123-
max_interval: 5s
124-
max_elapsed_time: 20s
125-
```
126-
127-
- **Compaction and Transaction Size:**
128-
Depending on your use case, adjust the `max_transaction_size` for checkpoint compaction. A smaller transaction size will make checkpoint files more frequent but smaller, while a larger size might reduce disk I/O but require more memory.
129-
130-
### Step 6: Monitoring and Maintenance
131-
132-
- **Monitoring the Collector:**
133-
Use Prometheus or other monitoring tools to collect metrics from the OpenTelemetry Collector. You can monitor retries, the state of the sending queue, and other performance metrics to ensure the collector is behaving as expected.
134-
135-
- **Log Rotation:**
136-
The `file` exporter has a built-in log rotation mechanism to ensure that logs do not fill up your disk.
137-
138-
```yaml
139-
exporters:
140-
file:
141-
path: ./agent.out
142-
rotation:
143-
max_megabytes: 2
144-
max_backups: 2
145-
```
146-
147-
This configuration rotates the log file when it reaches 2 MB, and keeps up to two backups.
148-
149-
### Conclusion
150-
151-
In this section, you learned how to enhance the resilience of the OpenTelemetry Collector by configuring the `file_storage/checkpoint` extension, setting up retry mechanisms for the OTLP exporter, and using a sending queue backed by file storage for storing data during temporary failures.
152-
153-
By leveraging file storage for checkpointing and queue persistence, you can ensure that your telemetry pipeline can recover gracefully from failures, making it more reliable for production environments.

0 commit comments

Comments
 (0)