-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Bug description
We use docker compose to run signoz. I just rebooted out signoz server (after applying security updates), and when it came back up, signoz-otel-collector service loaded before signoz-clickhouse. This mean that the collector, on init, failed to connect to clickhouse. Despite this, collector service remained in a state of running/ready in docker, but when our hosts were sending metrics to it, they were getting:
Feb 08 19:57:05 otelcol-contrib[612]: 2026-02-08T19:57:05.093Z info internal/retry_sender.go:133 Exporting failed. Will retry the request after interval. {"resource": {"service.instance.id": "bb80bc8d-875b-47a6-8b3d-002c24f6024a", "service.name": "otelcol-contrib", "service.version": "0.140.1"}, "otelcol.component.id": "otlp", "otelcol.component.kind": "exporter", "otelcol.signal": "logs", "error": "rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp REDACTED:4317: connect: connection refused"", "interval": "36.837733592s"}
Looking at https://github.com/SigNoz/signoz/blob/main/deploy/docker/docker-compose.yaml, the collector service has a dependency on "signoz", which in turn depends on "clickhouse", so in theory it should be waiting on clickhouse, but it seems like clickhouse container is reporting healthy before it actually is, because the collector bootup had this in the logs:
{"level":"info","ts":"2026-02-08T19:54:17.622Z","caller":"service@v0.128.0/service.go:282","msg":"Everything is ready. Begin running and processing data.","resource":{"service.instance.id":"77ac44d3-3144-4621-ae48-097fed6561c0","service.name":"/signoz-otel-collector","service.version":"dev"}}
{"level":"info","timestamp":"2026-02-08T19:54:17.816Z","caller":"signozcol/collector.go:120","msg":"Collector service is running"}
{"level":"error","timestamp":"2026-02-08T19:54:17.816Z","caller":"opamp/server_client.go:281","msg":"failed to apply config","component":"opamp-server-client","error":"failed to reload config: /var/tmp/collector-config.yaml: collector failed to restart: failed to build pipelines: failed to create "clickhouselogsexporter" exporter for data type "logs": cannot configure clickhouse logs exporter: dial tcp 172.18.0.4:9000: connect: connection refused","stacktrace":"github.com/SigNoz/signoz-otel-collector/opamp.(*serverClient).onRemoteConfigHandler\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/opamp/server_client.go:281\ngithub.com/SigNoz/signoz-otel-collector/opamp.(*serverClient).onMessageFuncHandler\n\t/home/runner/work/signoz-otel-collector/signoz-otel-collector/opamp/server_client.go:265\ngithub.com/open-telemetry/opamp-go/client/internal.(*receivedProcessor).ProcessReceivedMessage\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.19.0/client/internal/receivedprocessor.go:160\ngithub.com/open-telemetry/opamp-go/client/internal.(*wsReceiver).ReceiverLoop\n\t/home/runner/go/pkg/mod/github.com/open-telemetry/opamp-go@v0.19.0/client/internal/wsreceiver.go:94"}
Note that it reports "Collector service is running", but the very next thing it tries to do is connect to clickhouse and fails with connection refused, indicating clickhouse was not actually ready to receive connections.
For now, I will run docker restart signoz-otel-collector some time after clickhouse container reports ready, and that seems to allow collector to boot up properly.
Expected behavior
signoz-otel-collectorshould wait forsignoz-clickhouseto fully initialize (would require changing the clickhouse health requirements check)- collector should not report healthy to docker unless it can connect to clickhouse (if it did this, then docker would auto restart it and thing would have come back up automatically, but as it is, collector says healthy even though clickhouse connection fails)
Version information
- Signoz version: 0.110.1