-
Notifications
You must be signed in to change notification settings - Fork 130
Open
Labels
breaking changeChanges that break backward compatibilityChanges that break backward compatibilitychart:signozIssue related to signoz helm chartIssue related to signoz helm chart
Description
Background
The signoz-schema-migrator (exposed in the chart as schemaMigrator) was a Job responsible for running ClickHouse schema migrations before the rest of the SigNoz stack could start. Otel-collector used an init container that waited for this Job to exist and complete before starting.
There have been several instances of users facing install/upgrade failures, confusion, and workarounds tied to this design.
Current Issues
User impact and reported problems
-
#363 – Error on run otel-collector-migrate-init: jobs.batch "signoz-schema-migrator" not found
- Otel-collector (and otel-collector-metrics) pods fail on init with "signoz-schema-migrator-init" or "signoz-schema-migrator" Job not found.
- Root cause: Otel-collector depends on the migrator Job existing and completing before it can start. That dependency should not exist.
- Affects plain Helm, Terraform Helm provider, and Pulumi. Workarounds reported:
wait = false(Terraform),WaitForJobs: false(Pulumi), or manually creating a placeholder Job (signoz-schema-migrator) so the init containers can pass.
-
#538 – Replace k8s wait job with CH based decision
- Hooks have been painful to reason about and operate. Replacing the “wait for a Kubernetes Job” pattern with a status marker in the ClickHouse schema (e.g. migrator table/flag) to a readiness based on store state instead of Job existence.
- Community feedback: Hooks were flaky (job not created when expected, or pod placed after job was gone); Zookeeper/EBS remounts can cause the sync job to fail 5 times before the cluster is ready.
- The migrate-sync / CH-based check also lets you verify that all migrations have completed successfully.
- When a mutation is stuck in ClickHouse,
CREATE TABLE IF NOT EXISTS(or the sync that depends on it) can block and stall the entire migration process.
-
#505, #747 – BusyBox-based init containers
- The init containers used a BusyBox image to check ClickHouse readiness. This image was difficult to keep patched and had limited networking support. Removing the init containers eliminates the need to maintain a separate image.
Architectural and operational issues
- Ordering and lifecycle: Init containers that block on “Job exists and completes” run as part of Deployments that are applied in the same release, while the migrator Job is created in a separate phase. That makes success dependent on install/upgrade ordering and tooling (e.g.
wait/WaitForJobs) in ways that are easy to misconfigure and hard to debug. - Multiple Jobs and naming: The chart had both
signoz-schema-migrator-syncandsignoz-schema-migrator-asyncJobs. Users saw “job not found” for either name depending on install vs upgrade, adding confusion and brittle workarounds (e.g. creating both Jobs manually). - Operational fragility: In environments with slow storage (e.g. EBS remounts, Zookeeper restarts), the migration Job can hit timeouts or retries and fail, leaving the release in a bad state. Coupling startup to a short-lived Job makes the system more sensitive to cluster conditions.
Proposed solution
Replace schema-migrator with telemetryStoreMigrator and CH-based readiness
- Consolidate: Deprecate
schemaMigratorin favor oftelemetryStoreMigrator. A single Job (e.g.signoz-telemetrystore-migrator) now runs the migration steps (ready, bootstrap, sync, async) using the built-inmigratecommand in signoz-otel-collector, instead of a separate schema-migrator component. - Bootstrap command: The migrator checks if the
schema_migrationtable exists and runsCREATE TABLEonly if it does not. That way, if a mutation is stuck in ClickHouse,CREATE TABLE IF NOT EXISTSdoes not block and the sync process does not get stuck. - Decouple startup from Job existence: Otel-collector (and related components) no longer wait for the schema-migrator Job to exist. Instead, they use a ClickHouse-based check (e.g.
migrate sync check) in an init container. Readiness is determined by the state of the telemetry store (e.g. migration/sync status in ClickHouse), not by the presence or completion of a Kubernetes Job. - Clearer lifecycle: The telemetryStoreMigrator Job can still use Helm/Argo CD hooks (e.g.
pre-upgrade, Sync hooks) where needed, but the rest of the stack does not depend on that Job’s creation order for startup. This removes the chicken-and-egg failure seen in #363 and avoids the need for manual placeholder Jobs.
Migration for users
- Configuration: Any overrides under
schemaMigrator.*should be moved totelemetryStoreMigrator.*. The chart and NOTES/README document this (see upgrade guide for 0.113.0).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
breaking changeChanges that break backward compatibilityChanges that break backward compatibilitychart:signozIssue related to signoz helm chartIssue related to signoz helm chart