Replies: 1 comment 2 replies
-
You may find OATs interesting for your use case. TL;DR - it runs your application as a docker container with overridden OTel collector addresses to a self-contained Grafana instance, then queries it to search for logs/traces/metrics/profiles to assert that they are found. Assuming your application is containerised, it might provide a relatively simple way to verify that your application is producing the correct/expected telemetry signals. You can find examples of its use in the Grafana .NET OpenTelemetry distro, and one of my own applications. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
We have a simple OTEL collector setup today, with a single collector instance per environment (one for beta, one for staging and one for production). These "hub" collectors take telemetry from all of our applications, using the OTLP exporter on each app, and the OTLP receiver in the collector. Then, it pushes the telemetry to our Datadog instance using the
datadogexporter
anddatadogconnector
combo.We've recently enabled internal logs and metrics from the collector itself, which allows us to monitor the collector instances closely. This is super nice, of course, but we still have a gap that is "hard" to monitor properly: what if some random service is not properly propagating telemetry to the collector for whatever reason?
We can monitor for the presence of telemetry over time and alert if some service
X
hasn't sent any log, metric, or span, but this feels like a very fragile and finicky type of test. We would like something that is more reliable, more precise/deterministic.Additionally, if I did something like enable internal OTEL logs in the application, I would still be unable to push those out, since our only path to Datadog is through the OTLP exporter itself. It would basically be a catch 22 situation. Similarly, outputing to a path other than the OTLP exporter itself (such as to the console) would not provide any benefits either, as nothing is scrapping that console output today.
I was thinking about creating a health check implementation (using
Microsoft.Extensions.Diagnostics.HealthChecks
) that would do something like try to send a log, a span or a metric and see if any exceptions would happen. This is nice because the healthcheck is a separate communication path, over HTTP, so I can send information about the logs/metrics/traces not working over there and not have to rely on OTEL itself for reporting it. I can then have Datadog poll our healthcheck endpoints to actively monitor for issues, and alert if any problems are detected.But before I go and create something like that from scratch, I wanted to check with the community if there is something else I could be leveraging for this purpose.
How are you monitoring your applications today and making sure that they are flowing telemetry properly?
I assume that having a more elaborate collector setup, where each application gets its own "side-car" collector that then pushes to a centralized collector could help with stuff like this, as we could then use those side-car collectors to also add console log scrapping, and monitor those side-cars to see whether or not they are getting traffic. But even then, it would be possible that somehow the application is unable to send telemetry to its side-car collector and we would be back to the same issue again.
Any thoughts/ideas will be appreciated here.
Beta Was this translation helpful? Give feedback.
All reactions