How could I monitor my applications to make sure they are properly sending telemetry out? #6420

julealgon · 2025-07-30T20:23:08Z

julealgon
Jul 30, 2025

We have a simple OTEL collector setup today, with a single collector instance per environment (one for beta, one for staging and one for production). These "hub" collectors take telemetry from all of our applications, using the OTLP exporter on each app, and the OTLP receiver in the collector. Then, it pushes the telemetry to our Datadog instance using the datadogexporter and datadogconnector combo.

We've recently enabled internal logs and metrics from the collector itself, which allows us to monitor the collector instances closely. This is super nice, of course, but we still have a gap that is "hard" to monitor properly: what if some random service is not properly propagating telemetry to the collector for whatever reason?

We can monitor for the presence of telemetry over time and alert if some service X hasn't sent any log, metric, or span, but this feels like a very fragile and finicky type of test. We would like something that is more reliable, more precise/deterministic.

Additionally, if I did something like enable internal OTEL logs in the application, I would still be unable to push those out, since our only path to Datadog is through the OTLP exporter itself. It would basically be a catch 22 situation. Similarly, outputing to a path other than the OTLP exporter itself (such as to the console) would not provide any benefits either, as nothing is scrapping that console output today.

I was thinking about creating a health check implementation (using Microsoft.Extensions.Diagnostics.HealthChecks) that would do something like try to send a log, a span or a metric and see if any exceptions would happen. This is nice because the healthcheck is a separate communication path, over HTTP, so I can send information about the logs/metrics/traces not working over there and not have to rely on OTEL itself for reporting it. I can then have Datadog poll our healthcheck endpoints to actively monitor for issues, and alert if any problems are detected.

But before I go and create something like that from scratch, I wanted to check with the community if there is something else I could be leveraging for this purpose.

How are you monitoring your applications today and making sure that they are flowing telemetry properly?

I assume that having a more elaborate collector setup, where each application gets its own "side-car" collector that then pushes to a centralized collector could help with stuff like this, as we could then use those side-car collectors to also add console log scrapping, and monitor those side-cars to see whether or not they are getting traffic. But even then, it would be possible that somehow the application is unable to send telemetry to its side-car collector and we would be back to the same issue again.

Any thoughts/ideas will be appreciated here.

martincostello · 2025-07-30T20:32:15Z

martincostello
Jul 30, 2025
Collaborator

You may find OATs interesting for your use case.

TL;DR - it runs your application as a docker container with overridden OTel collector addresses to a self-contained Grafana instance, then queries it to search for logs/traces/metrics/profiles to assert that they are found.

Assuming your application is containerised, it might provide a relatively simple way to verify that your application is producing the correct/expected telemetry signals.

You can find examples of its use in the Grafana .NET OpenTelemetry distro, and one of my own applications.

2 replies

julealgon Jul 31, 2025
Author

Tests huh.... interesting.

Unfortunately, at this moment, a lot of our applications are not yet containerized, so having that as a requirement makes this difficult for us. We are moving in that direction though, so maybe in the near future this will be more feasible.

Now, I don't quite like the fact that it relies on a lot of extra tools (Loki, Grafana, Tempo, Prometheus are all tools we don't use today) even if they are self-contained: feels like we would be increasing our tool scope a lot and that might come with extra maintenance and knowledge burden.

It also seems to be more about testing the specific signals to check that they are output as expected, vs checking things like "is the connection working" or is there an environment/transient problem in the real application, which is more in line with what we need today.

Very cool project regardless though, I really appreciate you sharing that. We don't have any form of integration testing for telemetry today, and that could very well be a way for us to eventually add that to our process.

martincostello Jul 31, 2025
Collaborator

vs checking things like "is the connection working"

Well by definition, if there's no data to be found it's probably not working 😃. The test queries can be very generic, e.g. Did at least 1 HTTP request happen.

But yes, it's checking "am I emitting data?" not "am I putting data in the right place?".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How could I monitor my applications to make sure they are properly sending telemetry out? #6420

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How could I monitor my applications to make sure they are properly sending telemetry out? #6420

Uh oh!

julealgon Jul 30, 2025

Replies: 1 comment · 2 replies

Uh oh!

martincostello Jul 30, 2025 Collaborator

Uh oh!

julealgon Jul 31, 2025 Author

Uh oh!

Uh oh!

martincostello Jul 31, 2025 Collaborator

julealgon
Jul 30, 2025

Replies: 1 comment 2 replies

martincostello
Jul 30, 2025
Collaborator

julealgon Jul 31, 2025
Author

martincostello Jul 31, 2025
Collaborator