[deprecation] signoz-schema-migrator from signoz chart

## Background

The signoz-schema-migrator (exposed in the chart as `schemaMigrator`) was a Job responsible for running ClickHouse schema migrations before the rest of the SigNoz stack could start. Otel-collector used an init container that waited for this Job to exist and complete before starting.

There have been several instances of users facing install/upgrade failures, confusion, and workarounds tied to this design.

## Current Issues

### User impact and reported problems

- **[#363](https://github.com/SigNoz/charts/issues/363)** – *Error on run otel-collector-migrate-init: jobs.batch "signoz-schema-migrator" not found*
  - Otel-collector (and otel-collector-metrics) pods fail on init with "signoz-schema-migrator-init" or "signoz-schema-migrator" Job not found.
  - Root cause: Otel-collector depends on the migrator Job existing and completing before it can start. That dependency should not exist.
  - Affects plain Helm, Terraform Helm provider, and Pulumi. Workarounds reported: `wait = false` (Terraform), `WaitForJobs: false` (Pulumi), or manually creating a placeholder Job (`signoz-schema-migrator`) so the init containers can pass.

- **[#538](https://github.com/SigNoz/charts/issues/538)** – *Replace k8s wait job with CH based decision*
  - Hooks have been painful to reason about and operate. Replacing the “wait for a Kubernetes Job” pattern with a status marker in the ClickHouse schema (e.g. migrator table/flag) to a readiness based on store state instead of Job existence.
  - Community feedback: Hooks were flaky (job not created when expected, or pod placed after job was gone); Zookeeper/EBS remounts can cause the sync job to fail 5 times before the cluster is ready.
  - The migrate-sync / CH-based check also lets you verify that all migrations have completed successfully.
  - When a mutation is stuck in ClickHouse, `CREATE TABLE IF NOT EXISTS` (or the sync that depends on it) can block and stall the entire migration process.
- **[#505](https://github.com/SigNoz/charts/issues/505)**, **[#747](https://github.com/SigNoz/charts/issues/747)** – *BusyBox-based init containers*
  - The init containers used a BusyBox image to check ClickHouse readiness. This image was difficult to keep patched and had limited networking support. Removing the init containers eliminates the need to maintain a separate image.
 
### Architectural and operational issues

- **Ordering and lifecycle**: Init containers that block on “Job exists and completes” run as part of Deployments that are applied in the same release, while the migrator Job is created in a separate phase. That makes success dependent on install/upgrade ordering and tooling (e.g. `wait`/`WaitForJobs`) in ways that are easy to misconfigure and hard to debug.
- **Multiple Jobs and naming**: The chart had both `signoz-schema-migrator-sync` and `signoz-schema-migrator-async` Jobs. Users saw “job not found” for either name depending on install vs upgrade, adding confusion and brittle workarounds (e.g. creating both Jobs manually).
- **Operational fragility**: In environments with slow storage (e.g. EBS remounts, Zookeeper restarts), the migration Job can hit timeouts or retries and fail, leaving the release in a bad state. Coupling startup to a short-lived Job makes the system more sensitive to cluster conditions.

## Proposed solution

### Replace schema-migrator with telemetryStoreMigrator and CH-based readiness

- **Consolidate**: Deprecate `schemaMigrator` in favor of **`telemetryStoreMigrator`**. A single Job (e.g. `signoz-telemetrystore-migrator`) now runs the migration steps (ready, bootstrap, sync, async) using the built-in `migrate` command in signoz-otel-collector, instead of a separate schema-migrator component.
- **Bootstrap command**: The migrator checks if the `schema_migration` table exists and runs `CREATE TABLE` only if it does not. That way, if a mutation is stuck in ClickHouse, `CREATE TABLE IF NOT EXISTS` does not block and the sync process does not get stuck.
- **Decouple startup from Job existence**: Otel-collector (and related components) no longer wait for the schema-migrator Job to exist. Instead, they use a **ClickHouse-based check** (e.g. `migrate sync check`) in an init container. Readiness is determined by the state of the telemetry store (e.g. migration/sync status in ClickHouse), not by the presence or completion of a Kubernetes Job.
- **Clearer lifecycle**: The telemetryStoreMigrator Job can still use Helm/Argo CD hooks (e.g. `pre-upgrade`, Sync hooks) where needed, but the rest of the stack does not depend on that Job’s creation order for startup. This removes the chicken-and-egg failure seen in [#363](https://github.com/SigNoz/charts/issues/363) and avoids the need for manual placeholder Jobs.

### Migration for users

- **Configuration**: Any overrides under `schemaMigrator.*` should be moved to `telemetryStoreMigrator.*`. The chart and NOTES/README document this (see [upgrade guide for 0.113.0](https://signoz.io/docs/operate/migration/upgrade-0.113)).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deprecation] signoz-schema-migrator from signoz chart #828

Background

Current Issues

User impact and reported problems

Architectural and operational issues

Proposed solution

Replace schema-migrator with telemetryStoreMigrator and CH-based readiness

Migration for users

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[deprecation] signoz-schema-migrator from signoz chart #828

Description

Background

Current Issues

User impact and reported problems

Architectural and operational issues

Proposed solution

Replace schema-migrator with telemetryStoreMigrator and CH-based readiness

Migration for users

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions