-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Problem
Docker compose validator deployments have no CPU/memory limits, no restart policy, and a broken OTLP metrics export configuration:
-
No resource limits — containers consume all available resources. On the OVH test validator (8-core/30GB), ScyllaDB alone used 195% CPU and 14GB RAM, causing load averages of 14+ and degraded performance for all services.
-
No restart policy — when shards OOM or crash, they stay down until manually restarted. This causes the validator to become partially unavailable.
-
OTLP metrics export broken — the Alloy config has explicit
Content-Type: application/x-protobufheader andcompression = "gzip"in the Prometheus OTLP exporter, causing HTTP 400 errors from the remote endpoint. Theotelcol.exporter.otlphttpsets these automatically; the explicit values conflict.
Solution
- Add configurable CPU/memory limits via
LIMIT_CPUS_*/LIMIT_MEM_*env vars with sane defaults for an 8-core/32GB machine - Add
restart: unless-stoppedto all long-running services - Remove explicit Content-Type header and gzip compression from alloy-config.river
- Increase shard memory default to 2560M (1536M caused OOM kills under normal cross-chain message load)