Skip to content

[deprecation] signoz-schema-migrator from signoz chart #828

@Nageshbansal

Description

@Nageshbansal

Background

The signoz-schema-migrator (exposed in the chart as schemaMigrator) was a Job responsible for running ClickHouse schema migrations before the rest of the SigNoz stack could start. Otel-collector used an init container that waited for this Job to exist and complete before starting.

There have been several instances of users facing install/upgrade failures, confusion, and workarounds tied to this design.

Current Issues

User impact and reported problems

  • #363Error on run otel-collector-migrate-init: jobs.batch "signoz-schema-migrator" not found

    • Otel-collector (and otel-collector-metrics) pods fail on init with "signoz-schema-migrator-init" or "signoz-schema-migrator" Job not found.
    • Root cause: Otel-collector depends on the migrator Job existing and completing before it can start. That dependency should not exist.
    • Affects plain Helm, Terraform Helm provider, and Pulumi. Workarounds reported: wait = false (Terraform), WaitForJobs: false (Pulumi), or manually creating a placeholder Job (signoz-schema-migrator) so the init containers can pass.
  • #538Replace k8s wait job with CH based decision

    • Hooks have been painful to reason about and operate. Replacing the “wait for a Kubernetes Job” pattern with a status marker in the ClickHouse schema (e.g. migrator table/flag) to a readiness based on store state instead of Job existence.
    • Community feedback: Hooks were flaky (job not created when expected, or pod placed after job was gone); Zookeeper/EBS remounts can cause the sync job to fail 5 times before the cluster is ready.
    • The migrate-sync / CH-based check also lets you verify that all migrations have completed successfully.
    • When a mutation is stuck in ClickHouse, CREATE TABLE IF NOT EXISTS (or the sync that depends on it) can block and stall the entire migration process.
  • #505, #747BusyBox-based init containers

    • The init containers used a BusyBox image to check ClickHouse readiness. This image was difficult to keep patched and had limited networking support. Removing the init containers eliminates the need to maintain a separate image.

Architectural and operational issues

  • Ordering and lifecycle: Init containers that block on “Job exists and completes” run as part of Deployments that are applied in the same release, while the migrator Job is created in a separate phase. That makes success dependent on install/upgrade ordering and tooling (e.g. wait/WaitForJobs) in ways that are easy to misconfigure and hard to debug.
  • Multiple Jobs and naming: The chart had both signoz-schema-migrator-sync and signoz-schema-migrator-async Jobs. Users saw “job not found” for either name depending on install vs upgrade, adding confusion and brittle workarounds (e.g. creating both Jobs manually).
  • Operational fragility: In environments with slow storage (e.g. EBS remounts, Zookeeper restarts), the migration Job can hit timeouts or retries and fail, leaving the release in a bad state. Coupling startup to a short-lived Job makes the system more sensitive to cluster conditions.

Proposed solution

Replace schema-migrator with telemetryStoreMigrator and CH-based readiness

  • Consolidate: Deprecate schemaMigrator in favor of telemetryStoreMigrator. A single Job (e.g. signoz-telemetrystore-migrator) now runs the migration steps (ready, bootstrap, sync, async) using the built-in migrate command in signoz-otel-collector, instead of a separate schema-migrator component.
  • Bootstrap command: The migrator checks if the schema_migration table exists and runs CREATE TABLE only if it does not. That way, if a mutation is stuck in ClickHouse, CREATE TABLE IF NOT EXISTS does not block and the sync process does not get stuck.
  • Decouple startup from Job existence: Otel-collector (and related components) no longer wait for the schema-migrator Job to exist. Instead, they use a ClickHouse-based check (e.g. migrate sync check) in an init container. Readiness is determined by the state of the telemetry store (e.g. migration/sync status in ClickHouse), not by the presence or completion of a Kubernetes Job.
  • Clearer lifecycle: The telemetryStoreMigrator Job can still use Helm/Argo CD hooks (e.g. pre-upgrade, Sync hooks) where needed, but the rest of the stack does not depend on that Job’s creation order for startup. This removes the chicken-and-egg failure seen in #363 and avoids the need for manual placeholder Jobs.

Migration for users

  • Configuration: Any overrides under schemaMigrator.* should be moved to telemetryStoreMigrator.*. The chart and NOTES/README document this (see upgrade guide for 0.113.0).

Metadata

Metadata

Assignees

Labels

breaking changeChanges that break backward compatibilitychart:signozIssue related to signoz helm chart

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions