Skip to content

Conversation

@yhabteab
Copy link
Member

@yhabteab yhabteab commented Jan 15, 2026

This PR introduces the long-awaited OTelWriter, a new Icinga 2 component that enables seamless integration with OpenTelemetry. I'm a newbie to OpenTelemetry, so bear with me if you spot any obvious mistakes ;), therefore I would highly appreciate any feedback from OpenTelemetry experts (cc @martialblog and all the other users who reacted to the referenced issue).

First and foremost, this might surprise some of you, but this PR does not make use of the existing OpenTelemetry C++ SDK.
The reason for this is twofold:

  1. The OpenTelemetry C++ SDK is huge and complex, and none of the Icinga 2 developers (including myself) have experience with it. Nonetheless, I gave it a try, but gave up after a week of struggling to even get a simple example working. Also, when thinking debugging Icinga 2 issues related to OpenTelemetry, we would never be able to or at least it would be extremely hard to help our users if they run into problems with our OpenTelemetry integration. Furthermore, the SDK ABI version on my Mac (installed via Homebrew) is 1, which lacks many newer features and improvements that are only available in version 2. I also didn't even verify if the SDK is even available on all platforms we support but from my experience with this PR now, I doubt that it is.
  2. The default HTTP OpenTelemetry protocol (OTLP) implementation is based on curl and found it annoying to be greeted by mysterious crashes that I never encountered before. After some research, I found out that it is due to curl's multi-threading behavior clashing with Icinga 2's own multi-threading model. While this can surely be worked around, it's questionable whether the default OTLP client implementation using curl fits our requirements. Though, the SDK does provide a way to inject a custom client implementation, but honestly, I simply failed to get anything done that somehow aligns with Icinga 2's architecture, so I abandoned all hopes of using the OpenTelemetry C++ SDK.

Instead, I implemented a tiny OTLP HTTP client based on Boost.Beast that only supports OpenTelemetry metrics. That's right, no traces or logs, just metrics. Of course, it still uses Protocol Buffers for serialization as required by the OTLP specification, but without pulling in the entire OpenTelemetry C++ SDK. Also, since Icinga 2 just transforms the collected performance data into OpenTelemetry metrics (which means there's no way to know ahead of time which metrics with which names/units will be sent), the implementation doesn't provide any advanced aggregation features like the OpenTelemetry SDK does. Instead, it simply creates a single metric stream without any units or aggregation temporality, then appends each produced performance data transformed into an OTel Gauge metric data point to that stream. Here's how the OpenTelemetry collector debug printout looks like when sending some sample performance data to a local OpenTelemetry collector instance:

Expand Me
{
  "resourceMetrics": [
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "Icinga 2"
            }
          },
          {
            "key": "service.instance.id",
            "value": {
              "stringValue": "547bc214-5b76-484e-833d-2de90da1bb74"
            }
          },
          {
            "key": "service.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "telemetry.sdk.language",
            "value": {
              "stringValue": "cpp"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "Icinga 2 OTel Integration"
            }
          },
          {
            "key": "telemetry.sdk.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "service.namespace",
            "value": {
              "stringValue": "icinga"
            }
          },
          {
            "key": "icinga2.host.name",
            "value": {
              "stringValue": "something"
            }
          },
          {
            "key": "icinga2.command.name",
            "value": {
              "stringValue": "icinga"
            }
          }
        ],
        "entityRefs": [
          {
            "type": "host",
            "idKeys": [
              "icinga2.host.name"
            ]
          }
        ]
      },
      "scopeMetrics": [
        {
          "scope": {
            "name": "icinga2",
            "version": "v2.15.0-235-gb35b335f2"
          },
          "metrics": [
            {
              "name": "state_check.perfdata",
              "gauge": {
                "dataPoints": [
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_conn_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762502097898752",
                    "timeUnixNano": "1768762502101475072",
                    "asDouble": 0
                  },
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762502097898752",
                    "timeUnixNano": "1768762502101475072",
                    "asDouble": 0
                  }
                ]
              }
            }
          ],
          "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
        }
      ],
      "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
    },
    {
      "resource": {
        "attributes": [
          {
            "key": "service.name",
            "value": {
              "stringValue": "Icinga 2"
            }
          },
          {
            "key": "service.instance.id",
            "value": {
              "stringValue": "2d6c27cd-484d-436d-9542-b70abdaf2f76"
            }
          },
          {
            "key": "service.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "telemetry.sdk.language",
            "value": {
              "stringValue": "cpp"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "Icinga 2 OTel Integration"
            }
          },
          {
            "key": "telemetry.sdk.version",
            "value": {
              "stringValue": "v2.15.0-235-gb35b335f2"
            }
          },
          {
            "key": "service.namespace",
            "value": {
              "stringValue": "icinga"
            }
          },
          {
            "key": "icinga2.host.name",
            "value": {
              "stringValue": "something"
            }
          },
          {
            "key": "icinga2.service.name",
            "value": {
              "stringValue": "something-service"
            }
          },
          {
            "key": "icinga2.command.name",
            "value": {
              "stringValue": "icinga"
            }
          }
        ],
        "entityRefs": [
          {
            "type": "service",
            "idKeys": [
              "icinga2.host.name",
              "icinga2.service.name"
            ]
          }
        ]
      },
      "scopeMetrics": [
        {
          "scope": {
            "name": "icinga2",
            "version": "v2.15.0-235-gb35b335f2"
          },
          "metrics": [
            {
              "name": "state_check.perfdata",
              "gauge": {
                "dataPoints": [
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_conn_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762509990163200",
                    "timeUnixNano": "1768762510002787072",
                    "asDouble": 0
                  },
                  {
                    "attributes": [
                      {
                        "key": "label",
                        "value": {
                          "stringValue": "api_num_endpoints"
                        }
                      }
                    ],
                    "startTimeUnixNano": "1768762509990163200",
                    "timeUnixNano": "1768762510002787072",
                    "asDouble": 0
                  }
                ]
              }
            }
          ],
          "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
        }
      ],
      "schemaUrl": "https://opentelemetry.io/schemas/1.39.0"
    }
  ]
}

As already mentioned, this is a first implementation and everything is open for discussion but primarily about the
following aspects:

  • Eliminating all trivial attributes that don't add any value (these attributes are added only when the enable_send_metadata option is set and include attributes like icinga2.check.state, icinga2.check.latency, etc.). EDIT: There are now no such attributes anymore and the enable_send_metadata is gone as well.
  • Choosing better attribute names (currently prefixed with icinga2. to avoid collisions) and the overall metric naming (currently just icinga2.perfdata for all performance data points). EDIT: The metrics have been renamed to state_check.perfdata, state_check.threshold.warning etc. as suggested in Add OTLPMetricsWriter #10685 (comment).

The high-level class overview is as higlighted in the following mermaid UML diagram:

---
title: OTel Integration
---
classDiagram
    %% The two classes below are just a type alias definitions for better readability.
    note for OTelAttrVal "OTelAttrVal is implemented as a type alias not a class."
    note for OTelAttrsMap "OTelAttrsSet is implemented as a type alias not a class."

    class OTelAttrVal {
        <<type alias>>
        +std::variant~bool, int64_t, double, String>
    }

    class OTelAttrsMap {
        <<type alias>>
        +set~pair~String-AttrValue~~
    }
    OTelAttrsMap --o OTelAttrVal : manages

    class Gauge {
        -std::unique_ptr~proto::Gauge~ ProtoGauge

        +Transform(metric: proto::Metric*) void
        +IsEmpty() bool
        +Record(value: double|int64_t, start_time: double, end_time: double, attributes: OTelAttrsMap) std::size_t
    }
    OTelAttrsMap <.. Gauge : uses

    class OTel {
        -proto::ExportMetricsServiceRequest Request
        -std::optional~StreamType~ Stream
        -asio::io_context::strand Strand
        -std::atomic_bool Exporting, Stopped

        +Start() void
        +Stop() void
        +Export(MetricsRequest& request) void
        +IsExporting() bool
        +Stopped() bool

        +ValidateName(name: string_view) bool$
        +IsRetryableExportError(status: beast::http::status) bool$
        +PopulateResourceAttrs(rm: const std::unique_ptr~opentelemetry::proto::metrics::v1::ResourceMetrics~&) void$

        -Connect(yc: boost::asio::yield_context&) void
        -ExportLoop(yc: boost::asio::yield_context&) void
        -Export(yc: boost::asio::yield_context&) void
        -ExportingSet(exporting: bool, notifyAll: bool) void
    }
    AsioProtobufOutputStream <.. OTel : serializes via
    RetryableExportError <.. OTel : uses
    Backoff <.. OTel : uses

    `google::protobuf::io::ZeroCopyOutputStream` <|.. AsioProtobufOutputStream : implements
    class AsioProtobufOutputStream {
        -int64_t Pos
        -int64_t Buffered
        -HttpRequestWriter Writer
        -asio::yield_context& Yield

        +AsioProtobufOutputStream(stream: const StreamType&, info: const OTelConnInfo&, yield: asio::yield_context&)
        +Next(data: void**, size: int**) bool
        +BackUp(count: int) void
        +ByteCount() std::size_t
        -Flush(final: bool) bool
    }

    class RetryableExportError {
        -uint64_t Throttle
        +Throttle() uint64_t
        +what() const char*
    }

    class Backoff {
        +std::chrono::milliseconds MaxBackoff$
        +std::chrono::milliseconds MinBackoff$
        +operator()() std::chrono::milliseconds
    }
    class OTLPMetricsWriter {
        -unordered_set~shared_ptr~Metric~~ Metrics
        -OTel m_Exporter
    }
    Gauge "0...*" o-- "" OTLPMetricsWriter: produces
    OTelAttrsMap <.. OTLPMetricsWriter: creates
    OTel "1" o-- "" OTLPMetricsWriter: exports via
Loading

The OTelWriter by itself is pretty straightforward and doesn't contain any complex logic. The main OTel-related logic is encapsulated in a new library called otel, which provides an HTTP client that conforms to the OTLP HTTP protocol specification. The OTel class is the one used by the OTelWriter to export metrics to the OpenTelemetry collector. The OTel class internally uses several helper classes to build the required Protocol Buffers messages as per the OpenTelemetry specification. Unlike the existing metric writers, this client doesn't create separate HTTP connections for each metric export. Instead, it maintains a persistent connection to the OpenTelemetry collector and reuses it for subsequent exports until the connection is closed by either side. The Protobuf message is serialized directly into HTTP connection without any intermediate buffering of the serialized message. This is possible only because the OpenTelemetry Collector supports HTTP/1.1 chunked transfer encoding, which allows sending the message in chunks without knowing the entire message size beforehand.

That's it. Overall, this implementation is quite minimalistic and only implements the bare minimum required to send metrics to an OpenTelemetry collector.

Known Issues

Well, since the OpenTelemetry proto files require a proto3 language syntax, it turned out that not all our supported Distros provide a recent enough version of protoc that supports proto3. Those Distros are:

  • Amazon Linux 2 (will be EOL soon, so not a big deal)
  • Debian 11 (Bullseye) - will be EOL soon as well, but Ubuntu 22.04 LTS has the same issue, so I don't know yet how to deal with this one.
  • And finally, the big one: RHEL 8 and 9 also affected by this issue, so we will probably end up having to provide our own Protobuf packages for these Distros but that's a topic for another day.

Also, due the FindProtobuf module shipped with CMake version < 3.31.0 being completely broken, I ended up having to import that very same module from CMake 3.31.0 into our CMake third-party modules directory. This is obviously not ideal, but I didn't find any other way around this issue. Once we bump our minimum required CMake version to 3.31.0, we can remove this workaround again. So, the PR is being so huge partly because of this workaround.

Testing

Testing this PR is a non-trivial task as it requires some knowledge about OpenTelemetry/Prometheus and setting up a local
collector instance. Here's a brief guide for anyone interested in testing this PR:

First, you need to set up an OpenTelemetry Collector instance. You can use the official Docker image for this purpose. I've included two exporters in the configuration: a standard output exporter for debugging and a Prometheus exporter to scrape the metrics via Prometheus (choose whatever you're comfortable with).

otel-collector-config.yaml
receivers:
otlp:
  protocols:
    http:
      endpoint: 0.0.0.0:4318
#exporters:
#  debug:
#    verbosity: detailed
#service:
#  pipelines:
#    metrics:
#      receivers: [otlp]
#      processors: []
#      exporters: [debug]
exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [prometheus]

And then start the collector using the following command:

docker network create otel
docker run --network otel -p 4318:4318 --rm -v $(pwd)/otel-collector-config.yml:/etc/otelcol/config.yaml otel/opentelemetry-collector

If you chose the debug exporter instead of Prometheus, you will see the received metrics printed in the container logs, once Icinga 2 starts sending them. Otherwise, you have to add a Prometheus instance to scrape the metrics from the collector. For this, you can just use the following config and start Prometheus via Docker as well:

global:
  scrape_interval: 1m
scrape_configs:
  - job_name: "icinga2-metrics-scraper"
    static_configs:
      - targets: ["host.docker.internal:8889"] # You might need to adjust this address depending on your OS.
        labels:
          app: "icinga2-metrics-scraper"
    metrics_path: /metrics

And start Prometheus in the background:

docker run --network otel --name prometheus -d -p 9090:9090 -v $(pwd)/config.yml:/etc/prometheus/prometheus.yml prom/prometheus

Now, the only thing left to do is to build Icinga 2 with this PR applied and configure the OTelWriter. If you're an experienced Icinga 2 user that knows how to manually build Icinga 2 Docker images, you can do your own thing and skip the following steps. For everyone else, here's a quick guide:


  • First, clone the Icinga 2 repository and checkout this PR branch:
git clone [email protected]:Icinga/icinga2.git
cd icinga2
git checkout otel
  • Next, build a local Docker image using the Containerfile provided in the just cloned repository:
docker build --tag icinga/icinga2:otel --file Containerfile .

Afterwards, you can use this image icinga/icinga2:otel to start an Icinga 2 container with the OTelWriter configured just like any other Icinga 2 component.

Having done all of the above (especially if you chose the Prometheus exporter), how you verify that everything works as expected? Well, you have to do a few more things again :). Here's what I used to render some beautiful graphs in Icinga Web 2.


First, icingaweb2-module-perfdatagraphs-prometheus module developed by @oxzi. Though, in order to make it work with the data sent by the OTelWriter, you need to perform some monkey patching.

Monkey Patch
diff --git a/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php b/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
index 0208f86..ea60e2e 100644
--- a/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
+++ b/library/Perfdatagraphsprometheus/ProvidedHook/Perfdatagraphs/PerfdataSource.php
@@ -26,13 +26,13 @@ class PerfdataSource extends PerfdataSourceHook
     {
         // TODO: honor PerfdataRequest's includeMetrics, excludeMetrics
         $promQuery = '{';
-        $promQuery .= '__name__=~"icinga_check_result_perf.*"';
-        $promQuery .= ', host="' . $req->getHostname(). '"';
+        $promQuery .= '__name__="icinga2_perfdata"';
+        $promQuery .= ', icinga2_host="' . $req->getHostname(). '"';
         if ($req->isHostCheck()) {
-            $promQuery .= ', object_type="host"';
+            $promQuery .= ', icinga2_service=""';
         } else {
-            $promQuery .= ', object_type="service"';
-            $promQuery .= ', service="' . $req->getServicename() . '"';
+            //$promQuery .= ', object_type="service"';
+            $promQuery .= ', icinga2_service="' . $req->getServicename() . '"';
         }
         $promQuery .= '}';

@@ -42,7 +42,7 @@ class PerfdataSource extends PerfdataSourceHook
         $client = new Client();
         $promResponse = $client->request(
             'POST',
-            'http://localhost:9090/api/v1/query_range', // TODO: configurable
+            'http://host.docker.internal:9090/api/v1/query_range', // TODO: configurable
             [
                 'form_params' => [
                     'query' => $promQuery,
@@ -75,20 +75,31 @@ class PerfdataSource extends PerfdataSourceHook
         // ]
         $datasets = array();
         foreach ($promResponse['data']['result'] as $result) {
-            $label = $result['metric']['label'];
-            $name = $result['metric']['__name__'];
-
-            if (! array_key_exists($name, $rename)) {
-                throw new Exception('unexpected __name__ ' . $name);
-            }
-            $name = $rename[$name];
+            $metric = $result['metric'];
+            $label = $metric['icinga2_perfdata_label'];
+            $name = 'value';

             if (! array_key_exists($label, $datasets)) {
                 $datasets[$label] = [
-                    'unit' => array_key_exists('unit', $result['metric']) ? $result['metric']['unit'] : '',
+                    'unit' => '',
                     'times' => array(),
                     'vals' => array(),
                 ];
+                if (isset($metric['icinga2_perfdata_unit'])) {
+                    $datasets[$label]['unit'] = $result['metric']['icinga2_perfdata_unit'];
+                }
+                if (isset($metric['icinga2_perfdata_crit'])) {
+                    $datasets[$label]['critical'] = $metric['icinga2_perfdata_crit'];
+                }
+                if (isset($metric['icinga2_perfdata_warn'])) {
+                    $datasets[$label]['warning'] = $metric['icinga2_perfdata_warn'];
+                }
+                if (isset($metric['icinga2_perfdata_min'])) {
+                    $datasets[$label]['min'] = $metric['icinga2_perfdata_min'];
+                }
+                if (isset($metric['icinga2_perfdata_max'])) {
+                    $datasets[$label]['warning'] = $metric['icinga2_perfdata_max'];
+                }

                 foreach ($result['values'] as $valuePair) {
                     $datasets[$label]['times'][] = $valuePair[0];

Next, well, you need to install and configure another module yet again, the icingaweb2-module-perfdatagraphs module in your Icinga Web 2 instance. Follow the instructions in the repository to get it set up and use the above cloned module as a backend for this module.

If everything is set up correctly, you should start seeing performance data metrics in Prometheus as well as beautiful
graphs in Icinga Web 2.

Bildschirmfoto 2026-01-15 um 15 36 39 Bildschirmfoto 2026-01-15 um 15 38 06 Bildschirmfoto 2026-01-15 um 15 38 29

TODO

  • Missing documentation.

resolves #10439
resolves #9900

@cla-bot cla-bot bot added the cla/signed label Jan 15, 2026
@martialblog
Copy link
Member

Very nice work! I'll have a look at it.

This should also resolve #9900

@martialblog
Copy link
Member

martialblog commented Jan 16, 2026

Hi,

I spent some time testing the new Writer. Here's some first feedback:

  1. OTLPMetricsWriter

Maybe the name of the writer could be OTLPMetricWriter instead of OTelWriter.
It more clearly describes what it does. Also that gives you room to maybe have a
OTLPLogsWriter in the future.

Just an idea. OTelWriter is also fine.

  1. enable_send_metadata = true

When I set enable_send_metadata = true then the daemon crashes when wants to send data.
As we discussed, maybe the enable_send_metadata is not yet required and we can remove this in the first release.

  1. Resource service.namespace

Maybe the resource attribute service.namespace should not be hard coded to "icinga".

From the OpenTelemetry Conventions:

type: service.namespace
Description: Groups related services that compose a system or application under a common namespace
A string value having a meaning that helps to distinguish a group of services, for example the team name that owns a group of services.

I think the namespace is more akin to a "Kubernetes Namespace".
Meaning users maybe want to set something like "icinga-production" or "icinga staging".

I'm not 100% sure if this is something that should be set via the actual Icinga Service Objects or
once for the Icinga OTLP Writer.

  1. Host and Service resource attributes

Currently the Icinga Host/Service information is set as an attribute, maybe these should be resource attributes.

From the OpenTelemetry Conventions:

A Resource is a representation of the entity producing telemetry as Attributes. For example, You could have a process producing telemetry that is running in a container on Kubernetes, which is associated to a Pod running on a Node that is a VM but also is in a namespace and possibly is part of a Deployment. Resource could have attributes to denote information about the Container, the Pod, the Node, the VM or the Deployment.

There are Host and Service conventions for this:

I think it's ok if they are "namespaced" like this "icinga2.host.name" and "icinga2.service.name".

  1. icinga2_perfdata Metric Name

Maybe we want a more generic metric name than "icinga2_perfdata".
Since the source (Icinga) can be determined from the (resource) attributes.

icinga2_perfdata{icinga2_check_command="procs", icinga2_host="674c37a9881b", icinga2_perfdata_label="procs", icinga2_service="procs", instance="44a9cbd1-cbf1-4728-a352-b32727382928", job="icinga/icinga2", service_instance_id="44a9cbd1-cbf1-4728-a352-b32727382928", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

For example there are proposals for a health_check.status and a health_check.threshold metric. Personally I think "state_check" is a good namespace to start with. Then we can have "state_check.perfdata", "state_check.threshold", "state_check.min", and so on.

See also open-telemetry/semantic-conventions#1106

  1. Thresholds should be metrics

When enable_send_thresholds = true is set the thresholds are added as attributes.

icinga2_perfdata{icinga2_check_command="load", icinga2_host="674c37a9881b", icinga2_perfdata_crit="6", icinga2_perfdata_label="load5", icinga2_perfdata_min="0", icinga2_perfdata_warn="4", icinga2_service="load", instance="f3028916-6b63-4357-97c3-2e281e1e4b2f", job="icinga/icinga2", service_instance_id="f3028916-6b63-4357-97c3-2e281e1e4b2f", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

This makes it hard to work with them for example when plotting them. They should be encoded as metrics with the same attributes as the perfdata metric. For example:

state_check.perfdata{
  service.name=icinga
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 3

state_check.threshold{
  service.name=icinga
  icinga2.threshold=warning
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 10

state_check.threshold{
  service.name=icinga
  icinga2.threshold=critical
  icinga2.host.name=node1
  icinga2.check.name=checkload
  icinga2.service.name=load} 20
  1. Prometheus OTLP

When I send the data to the Prometheus OTLP Writer. The Icinga2 daemon logs some warnings due to the response headers I think:

[2026-01-16 10:27:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'otel-collector:4318'.
[2026-01-16 10:27:19 +0000] information/OTelWriter: 'prometheus' resumed.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'prometheus:9090'.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-16 10:27:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-16 10:27:19 +0000] information/CheckerComponent: 'checker' started.
[2026-01-16 10:27:19 +0000] information/ConfigItem: Activated all objects.
[2026-01-16 10:27:29 +0000] information/WorkQueue: #6 (OTelWriter, otel) items: 0, rate: 0.116667/s (7/min 7/5min 7/15min);
[2026-01-16 10:27:29 +0000] information/WorkQueue: #7 (OTelWriter, prometheus) items: 0, rate: 0.116667/s (7/min 7/5min 7/15min);
[2026-01-16 10:27:34 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:27:49 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:28:04 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
[2026-01-16 10:28:19 +0000] warning/OTelExporter: Unexpected Content-Type from OpenTelemetry collector:  (OK).
Here is my test setup
---
services:
  icinga:
    image: localhost/icinga/icinga2:otel
    entrypoint: sleep infinity
    user: root
  prometheus:
    image: docker.io/prom/prometheus
    privileged: true
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-otlp-receiver'
    ports:
      - '9090:9090'
  otel-collector:
    image: docker.io/otel/opentelemetry-collector
    volumes:
      - ./otel.yml:/etc/otelcol/config.yaml
      - ./metrics.json:/app/metrics.json
    ports:
      - 4318:4318
  grafana:
    image: docker.io/grafana/grafana
    ports:
      - '3000:3000'
OTel Collector
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318
exporters:
  debug:
    verbosity: detailed
  file:
    path: /app/metrics.json
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: []
      exporters: [debug,file]
prometheus.yml
global:
  scrape_interval: 5s
otlp:
  promote_resource_attributes:
    - service.instance.id
    - service.name
    - service.namespace
    - service.version

@martialblog
Copy link
Member

martialblog commented Jan 16, 2026

Also tested it with Grafana Mimir. Works great

Grafana Mimir
object OTelWriter "prometheus" {
  host = "prometheus"
  port = 9090
  metrics_endpoint = "/api/v1/otlp/v1/metrics"
}

object OTelWriter "mimir" {
  host = "mimir"
  port = 8080
  metrics_endpoint = "/otlp/v1/metrics"
}
services:
  mimir:
    image: docker.io/grafana/mimir:latest
    command: ["-config.file=/etc/mimir.yaml"]
    ports:
      - '8080:8080'
    volumes:
      - ./mimir.yaml:/etc/mimir.yaml

cat mimir.yaml 
---
multitenancy_enabled: false

blocks_storage:
  backend: filesystem
  bucket_store:
    sync_dir: /tmp/mimir/tsdb-sync
  filesystem:
    dir: /tmp/mimir/data/tsdb
  tsdb:
    dir: /tmp/mimir/tsdb

compactor:
  data_dir: /tmp/mimir/compactor
  sharding_ring:
    kvstore:
      store: memberlist

distributor:
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: memberlist

ingester:
  ring:
    instance_addr: 0.0.0.0
    kvstore:
      store: memberlist
    replication_factor: 1

ruler_storage:
  backend: filesystem
  filesystem:
    dir: /tmp/mimir/rules

server:
  http_listen_port: 8080
  log_level: error

store_gateway:
  sharding_ring:
    replication_factor: 1

@martialblog
Copy link
Member

Another small note, I think icinga2_perfdata_unit can simply be "unit":

icinga2_perfdata{icinga2_check_command="disk", icinga2_host="938b0ea145df", icinga2_perfdata_crit="903362170060", icinga2_perfdata_label="/data", icinga2_perfdata_max="1003735744512", icinga2_perfdata_min="0", icinga2_perfdata_unit="bytes", icinga2_perfdata_warn="802988595609", icinga2_service="disk", instance="28e0fd2c-ddbc-48e3-8e6f-d158943bdc1d", job="icinga/icinga2", service_instance_id="28e0fd2c-ddbc-48e3-8e6f-d158943bdc1d", service_name="icinga2", service_namespace="icinga", service_version="v2.15.0-235-gb35b335f2"}

This was referenced Jan 16, 2026
@yhabteab
Copy link
Member Author

Thanks for the feedback and compose file you provided!

  1. OTLPMetricsWriter

I will bring this into our weekly meetings next week for discussion.

  1. enable_send_metadata = true

Aye, that was an oversight on my part. Though, I've dropped it completely now, so there shouldn't be any daemon crashes now :).

  1. Resource service.namespace

Maybe the resource attribute service.namespace should not be hard coded to "icinga".
I'm not 100% sure if this is something that should be set via the actual Icinga Service Objects or once for the Icinga OTLP Writer.

We must look into this from the Icinga 2 side and not from an OTel perspective. There is no way we're going to introduce
a new attribute to all host and service objects just for this purpose. So, the alternative is to at least make it configurable
so that users can set it on a per-instance basis. That way, each OTelWriter instance can have its own namespace.

  1. Host and Service resource attributes

Currently the Icinga Host/Service information is set as an attribute, maybe these should be resource attributes.

Ack! Will change that.

  1. icinga2_perfdata Metric Name

For example there are proposals for a health_check.status and a health_check.threshold metric. Personally I think "state_check" is a good namespace to start with. Then we can have "state_check.perfdata", "state_check.threshold", "state_check.min", and so on.

That makes sense. I'll update the metric names accordingly, especially since there's a proposal for this.

  1. Thresholds should be metrics

This makes it hard to work with them for example when plotting them. They should be encoded as metrics with the same attributes as the perfdata metric.

Good point! I'm going transform all thresholds into separate metric streams then, i.e, state_check.threshold.crit, state_check.threshold.warn, etc.

  1. Prometheus OTLP

I didn't know about this, so thanks for testing it out! I'll fix it.

@yhabteab
Copy link
Member Author

I've addressed all your feedbacks apart from the OTelWriter naming thing. I've also updated the PR description and included a JSON example that show case how the metrics would look like in an OTel collector. Please have another look when you get the chance. Thanks!

@yhabteab
Copy link
Member Author

Sorry. I had to fix one issue introduced with my last push (thanks @martialblog for testing!). Apparently, namespace is a reserved keyword in Icinga 2 DSL, so can't be used as an attribute name.

@Al2Klimov
Copy link
Member

Hold my beer. 😉

Icinga 2 (version: v2.15.0-232-g6701edf6e)
Type $help to view available commands.
<1> => {namespace = 1}
                  ^
syntax error, unexpected = (T_SET)
<2> => {"namespace" = 1}
{
	@namespace = 1.000000
}
<3> => {@namespace = 1}
{
	@namespace = 1.000000
}
<4> => { {{{namespace}}} = 1 }
{
	@namespace = 1.000000
}
<5> =>

@yhabteab
Copy link
Member Author

Thanks! I'm aware that DSL users can escape keywords but there's no point in using a reserved word as an attribute for a built-in config object.

@martialblog
Copy link
Member

I encountered a strange issue with the new code 71028d3a297844ce855052b72672b618d5179669.

The deamon did tell me it was flushing data, but then never did.

[2026-01-19 14:10:54 +0000] information/OTelWriter: Flushing OTel metrics to OpenTelemetry collector (timer expired).

@yhabteab and I did some debugging and isolated this area. When replacing the ASSERT with VERIFY is seems to work again. Yonas has the details.

 void OTel::Export(boost::asio::yield_context& yc)
 {
        AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
-       ASSERT(m_Request->SerializeToZeroCopyStream(&outputS));
+      VERIFY(m_Request->SerializeToZeroCopyStream(&outputS));

@yhabteab
Copy link
Member Author

yhabteab commented Jan 19, 2026

@yhabteab and I did some debugging and isolated this area. When replacing the ASSERT with VERIFY is seems to work again. Yonas has the details.

Thanks for your help! Apparently, assert() isn't allowed to have any side effects, which I was not aware of (thanks @jschmidt-icinga for confirming this). I was building my local images always in debug mode, so the side effect of SerializeToZeroCopyStream() was always executed, but @martialblog was using release builds where the ASSERT() with the function call in it was optimized away, leading to very confusing behavior.

	AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};
	ASSERT(m_Request->SerializeToZeroCopyStream(&outputS));

C++ code after the C++ preprocessor has run (clang++ -I=... -E otel.cpp > otel.tmp.cpp:

 AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};

 ((void)0);

On the other hand, when using VERIFY nothing is optimized away:

 AsioProtobufOutStream outputS{*m_Stream, m_ConnInfo, yc};

 ((m_Request->SerializeToZeroCopyStream(&outputS)) ? void(0) : icinga_assert_fail("m_Request->SerializeToZeroCopyStream(&outputS)", "otel.cpp", 340));

I don't know what I was thinking when I used ASSERT this way 🤦🏻‍♂️!

#ifndef I2_DEBUG
# define ASSERT(expr) ((void)0)

I've fixed it now and should behave normally.

@martialblog
Copy link
Member

Did some testing with OpenSearch Data-Prepper, I did manage to send data successfully to OpenSearch like this:

object OTelWriter "data-prepper" {
  host = "data-prepper"
  port = 21893
  metrics_endpoint = "/opentelemetry.proto.collector.metrics.v1.MetricsService/Export"
}

However, I did see some "critical" errors in the Icinga2 logs:

[2026-01-19 14:59:19 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'data-prepper:21893'.
[2026-01-19 14:59:19 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-19 15:00:19 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 1, rate: 0.283333/s (17/min 82/5min 164/15min);
[2026-01-19 15:00:49 +0000] information/ConfigObject: Dumping program state to file '/data/var/lib/icinga2/icinga2.state'
[2026-01-19 15:00:49 +0000] critical/OTelExporter: Error: Error: end of stream [beast.http:1 at /usr/include/boost/beast/http/impl/read.hpp:231 in function 'operator()']
[2026-01-19 15:00:49 +0000] information/OTelExporter: Connecting to OpenTelemetry collector on host 'data-prepper:21893'.
[2026-01-19 15:00:49 +0000] information/OTelExporter: Successfully connected to OpenTelemetry collector.
[2026-01-19 15:00:59 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 0, rate: 0.266667/s (16/min 80/5min 169/15min);
[2026-01-19 15:01:49 +0000] information/WorkQueue: #6 (OTelWriter, data-prepper) items: 1, rate: 0.25/s (15/min 79/5min 184/15min); empty in 9 seconds
[2026-01-19 15:01:49 +0000] critical/OTelExporter: Error: Error: end of stream [beast.http:1 at /usr/include/boost/beast/http/impl/read.hpp:231 in function 'operator()']
Compose with OpenSearch Data-Prepper
---
version: '3'
services:
  icinga:
    image: localhost/icinga/icinga2
    entrypoint: sleep infinity
    user: root

  data-prepper:
    image: docker.io/opensearchproject/data-prepper
    container_name: data-prepper
    volumes:
      - ./metric_pipeline.yaml:/usr/share/data-prepper/pipelines/metric_pipeline.yaml
      - ./data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml
    ports:
      - 2021:2021
      - 21891:21891
      - 21893:21893
      - 4900:4900

  opensearch:
    container_name: opensearch
    image: docker.io/opensearchproject/opensearch:3.4.0
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms1024m -Xmx1024m"
      - "OPENSEARCH_INITIAL_ADMIN_PASSWORD=Developer@123"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    ports:
      - 9200:9200

  dashboards:
    image: docker.io/opensearchproject/opensearch-dashboards:3.1.0
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch:9200"]'
# cat metric_pipeline.yaml 
metric-pipeline:
  source:
    otlp:
      unframed_requests: true
      health_check_service: true
      authentication:
        unauthenticated:
      ssl: false
  sink:
    - stdout:
    - opensearch:
        hosts: [ "https://opensearch:9200" ]
        insecure: true
        username: admin
        password: Developer@123
        index: otel_metrics

# cat data-prepper-config.yaml 
ssl: false

@yhabteab
Copy link
Member Author

However, I did see some "critical" errors in the Icinga2 logs:

I don't know, how OpenSearch behave and whether their OTELP receiver fully conforms to the OTELP specs but that looks like OpenSearch is closing the connection after some time (no persistent HTTP connection support?). I'll try to go through their docs and see if I can find something about that.

@yhabteab
Copy link
Member Author

yhabteab commented Jan 19, 2026

However, I did see some "critical" errors in the Icinga2 logs:

I don't know, how OpenSearch behave and whether their OTELP receiver fully conforms to the OTELP specs but that looks like OpenSearch is closing the connection after some time (no persistent HTTP connection support?). I'll try to go through their docs and see if I can find something about that.

I can't find anything about persistent connections in the Data Prepper1 docs so far, but the OTel spec2 clearly says:

The client SHOULD keep the connection alive between requests.

However, OpenSearch doesn't seem to honor that, so it closes the connection after each request, I guess it's because the sentence is rephrased as SHOULD and not MUST. Nonetheless, I will try to detect such cases and degrade from a critical to some other log severity instead.

$ netstat -ant | grep 21893
tcp4       0      0  127.0.0.1.21893        127.0.0.1.51712        FIN_WAIT_2 # OpenSearch closed the connection is waiting for remote peer to close it.
tcp4       0      0  127.0.0.1.51712        127.0.0.1.21893        CLOSE_WAIT # OpenSearch has closed the conn but Icinga 2 didn't close it.
tcp4       0      0  *.21893                *.*                    LISTEN

Footnotes

  1. https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/otel-metrics-source/#metrics

  2. https://opentelemetry.io/docs/specs/otlp/#otlphttp-connection

@martialblog
Copy link
Member

Yeah that makes sense, if the client only SHOULD keep the connection alive then a less severe log level is alright.

@yhabteab
Copy link
Member Author

I've fixed the critical logs shown in #10685 (comment) by degrading that specific http::end_of_stream error to a debug log.

@yhabteab yhabteab added the area/opentelemetry Metrics to OpenTelemetry. label Jan 20, 2026
@yhabteab yhabteab added this to the 2.16.0 milestone Jan 20, 2026
@yhabteab yhabteab changed the title Add OTelWriter Add OTLPMetricsWriter Jan 21, 2026
@yhabteab
Copy link
Member Author

The newly pushed commits include the following changes:

  • I've renamed the writer to OTLPMetricsWriter as suggested by @martialblog in his first comment to better reflect its purpose. The feature can now be enabled by using the otlpmetrics by default.
  • Instead of using a randomly generated UUID for the service.instance.id attribute (that changes on every restart), I've switched to a SHA1 hash composed of the checkable name and service namespace. This ensures uniqueness while maintaining consistency across restarts, as per OTel specifications.
  • I've added the missing docs for the writer.

Comment on lines +226 to +238
/**
* A zero-copy output stream that writes directly to an Asio [TLS] stream.
*
* This class implements the @c google::protobuf::io::ZeroCopyOutputStream interface, allowing Protobuf
* serializers to write data directly to an Asio [TLS] stream without unnecessary copying of data. It
* doesn't buffer data internally, but instead writes it in chunks to the underlying stream using an HTTP
* request writer (@c HttpRequestWriter) in a Protobuf binary format. It is not safe to be reused across
* multiple export calls.
*
* @ingroup otel
*/
class AsioProtobufOutStream final : public google::protobuf::io::ZeroCopyOutputStream
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be moved to its own non-Otel-specific header so it's easier for other classes to use it, now that we have the dependency on protobuf. I get that it's still behind a build switch, so we can't just put it in "remote/protobuf.hpp". Maybe we can put it in "lib/protobuf/protobuf.hpp" or something like that?

Comment on lines 146 to 156
/**
* HTTP request serializer with support for efficient streaming of the body.
*
* This class is similar to @c HttpResponse but is specifically designed for sending HTTP requests with
* potentially large bodies that are generated on-the-fly. Just as with HTTP responses, requests can use
* chunk encoding too if the server on the other end supports it.
*
* @ingroup remote
*/
class HttpRequestWriter : public boost::beast::http::request<SerializableBody<boost::beast::flat_buffer>>
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike the naming and asymmetry of this class, and that it essentially needs to copy code because the existing class is not general enough.

I've got an idea how to fix that and make templated (Incoming|Outgoing)HttpMessage classes that work for more general use-cases. I wanted to do this originally in #10516, but couldn't justify it because nothing else was using that code. But with this PR I think there's an argument for that here and it could help with my #10668 as well.

I'll make a refactor PR with no functional changes for master and link that here, then we have a consistent class hierarchy and this PR can get a little smaller. I'll add all the functions you additionally need here (Commit(), Prepare() and the Stream as a variant), too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds a good idea! I initially wanted to generalize the existing class as well, but then decided against it as it would have made the PR even larger.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +5 to +17
if(NOT ICINGA2_OPENTELEMETRY_PROTOS_DIR STREQUAL "" AND NOT EXISTS "${ICINGA2_OPENTELEMETRY_PROTOS_DIR}")
message(FATAL_ERROR "The provided ICINGA2_OPENTELEMETRY_PROTOS_DIR '${ICINGA2_OPENTELEMETRY_PROTOS_DIR}' does not exist!")
elseif(ICINGA2_OPENTELEMETRY_PROTOS_DIR STREQUAL "")
message(STATUS "Fetching OpenTelemetry proto files...")
include(FetchContent)
FetchContent_Declare(
opentelemetry-proto
GIT_REPOSITORY https://github.com/open-telemetry/opentelemetry-proto.git
GIT_TAG v1.9.0
)
FetchContent_MakeAvailable(opentelemetry-proto)
set(ICINGA2_OPENTELEMETRY_PROTOS_DIR "${opentelemetry-proto_SOURCE_DIR}")
endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side-note:

I actually didn't realize that we have FetchContent in our minimum CMake version, but I guess we do since recently, it's 3.11. This opens up the possibility for pulling in header libraries directly from the source, for example nlohmann_json (I might make a PR for that) or at some point maybe even doctest🥲.

This module is copied from CMake's official module repository[^1] and
contains only minor changes as outlined below.

```diff
--- a/third-party/cmake/protobuf/FindProtobuf.cmake
+++ b/third-party/cmake/protobuf/FindProtobuf.cmake
@@ -218,9 +218,6 @@ Example:
         GENERATE_EXTENSIONS .grpc.pb.h .grpc.pb.cc)
 #]=======================================================================]

-cmake_policy(PUSH)
-cmake_policy(SET CMP0159 NEW) # file(STRINGS) with REGEX updates CMAKE_MATCH_<n>
-
 function(protobuf_generate)
        set(_options APPEND_PATH DESCRIPTORS)
        set(_singleargs LANGUAGE OUT_VAR EXPORT_MACRO PROTOC_OUT_DIR PLUGIN PLUGIN_OPTIONS DEPENDENCIES)
@@ -503,7 +500,7 @@ if( Protobuf_USE_STATIC_LIBS )
        endif()
 endif()

-include(${CMAKE_CURRENT_LIST_DIR}/SelectLibraryConfigurations.cmake)
+include(SelectLibraryConfigurations)

 # Internal function: search for normal library as well as a debug one
 #    if the debug one is specified also include debug/optimized keywords
@@ -768,7 +765,7 @@ if(Protobuf_INCLUDE_DIR)
        endif()
 endif()

-include(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake)
+include(FindPackageHandleStandardArgs)
 FIND_PACKAGE_HANDLE_STANDARD_ARGS(Protobuf
        REQUIRED_VARS Protobuf_LIBRARIES Protobuf_INCLUDE_DIR
        VERSION_VAR Protobuf_VERSION
@@ -805,5 +802,3 @@ foreach(Camel
        string(TOUPPER ${Camel} UPPER)
        set(${UPPER} ${${Camel}})
 endforeach()
-
-cmake_policy(POP)
```

[^1]: https://github.com/Kitware/CMake/blob/v3.31.0/Modules/FindProtobuf.cmake
@yhabteab
Copy link
Member Author

Since I had to rebase this, force push was unavoidable, so while force-pushing anyway, I've cleaned up the commits a bit.

@martialblog
Copy link
Member

@yhabteab As discussed, we probably need some attributes to identify multiple metrics for one check command.

The load check for example returns load1, load5, load15 and has different warn/crit thresholds for each:
https://icinga.com/docs/icinga-2/latest/doc/10-icinga-template-library/#load

Copy link
Member

@Al2Klimov Al2Klimov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging this PR, make sure the GHA are green.

we will probably end up having to provide our own Protobuf packages for these Distros

No. Instead, disable OTLPMetricsWriter on the problematic distros.

If we can require PHP 8.2 for the whole Icinga Web 2 (Icinga/icingaweb2#5444) and *nix for any perfdata writer (#9704), we can surely require existing Protobuf packages for this shiny new feature.

Additional packages are just an additional burden. Even if they're feasible by themselves, not all of our repos are available to the GHA.

For the record:

Amazon Linux 2 (will be EOL soon, so not a big deal)

Standard Support: Ends in 5 months (30 Jun 2026)

https://endoflife.date/amazon-linux

Debian 11 (Bullseye) - will be EOL soon as well

EOL LTS: 2026-08-31

https://wiki.debian.org/DebianReleases

Ubuntu 22.04 LTS has the same issue

END OF STANDARD SUPPORT: June 2027

https://documentation.ubuntu.com/project/release-team/list-of-releases/

And finally, the big one: RHEL 8

Maintenance support ends: May 31, 2029

https://access.redhat.com/support/policy/updates/errata#Life_Cycle_Dates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/opentelemetry Metrics to OpenTelemetry. cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OpenTelemetry Writer Prometheus remote writer

5 participants