ref(migrations): Use ON CLUSTER for DDL statements #7668

phacops · 2026-01-25T20:26:05Z

Summary

Refactor migration DDL operations to use ClickHouse's ON CLUSTER syntax instead of executing the same SQL on each node individually. This is more efficient and atomic for multi-node clusters since ClickHouse handles distributing the DDL to all nodes.

Changes

Core refactoring:

Add _get_on_cluster_clause() helper that returns the appropriate ON CLUSTER clause based on target type:
- LOCAL target → uses cluster_name (for storage nodes)
- DISTRIBUTED target → uses distributed_cluster_name (for query nodes)
Add _get_execution_node() helper to get a single node for execution (ON CLUSTER handles distribution)
Modify execute() to use single-node execution with ON CLUSTER syntax
Add _execute_per_node() helper for operations that don't support ON CLUSTER

DDL operations updated to include ON CLUSTER:

CreateTable, CreateMaterializedView, RenameTable, DropTable, TruncateTable
AddColumn, DropColumn, ModifyColumn
ModifyTableTTL, RemoveTableTTL
ModifyTableSettings, ResetTableSettings
AddIndex, AddIndices, DropIndex, DropIndices

Operations with smart execution:

RunSql - Detects if SQL contains ON CLUSTER (case-insensitive):
- If present: uses single-node execution (lets ClickHouse handle distribution)
- If absent: uses per-node execution (runs on each node individually)
InsertIntoSelect - DML operation, always uses per-node execution

Settings changes:

Add alter_sync=2 to MIGRATE settings to ensure ClickHouse blocks until all replicas confirm completion
Increase MIGRATE timeout from 10s to 5 minutes to allow ON CLUSTER operations to complete across all replicas

Logging improvements:

Log full SQL query in migration logs instead of truncating to 32 characters

Test improvements:

All tests in test_operations.py now mock get_cluster for deterministic behavior
Added helper functions _make_single_node_mock_cluster() and _make_mock_cluster()
Added tests for RunSql ON CLUSTER detection behavior
Total: 47 tests in test_operations.py

Test plan

Unit tests pass (pytest tests/migrations/test_operations.py -v)
All migration tests pass (pytest tests/migrations/ -v)
Type checking passes (mypy)
CI: Tests pass with distributed cluster setup

🤖 Generated with Claude Code

Refactor migration DDL operations to use ClickHouse's ON CLUSTER syntax instead of executing the same SQL on each node individually. This is more efficient and atomic for multi-node clusters. Changes: - Add `_get_on_cluster_clause()` helper to SqlOperation that returns the ON CLUSTER clause for multi-node clusters - Add `_get_execution_node()` helper to get a single node for execution (ON CLUSTER handles distribution to other nodes) - Modify `execute()` to use single-node execution instead of per-node iteration - Add `alter_sync=2` to MIGRATE settings so ClickHouse blocks until all replicas confirm completion (removes need for mutation polling) - Update all DDL operations (CreateTable, DropTable, AddColumn, etc.) to include ON CLUSTER clause in their format_sql() methods - Keep InsertIntoSelect using per-node execution (it's DML, not DDL) - Remove _block_on_mutations() polling since alter_sync=2 handles this Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Fix ON CLUSTER clause to use the appropriate cluster name based on target: - LOCAL target: uses cluster_name (for storage nodes) - DISTRIBUTED target: uses distributed_cluster_name (for query nodes) This fixes the test_distributed_migrations test failures where the distributed tables (like migrations_dist) weren't being created because we were using the wrong cluster name. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Increase the MIGRATE client settings timeout from 10 seconds to 5 minutes (300000ms) to allow ON CLUSTER DDL operations to complete across all replicas. This is needed because alter_sync=2 blocks until all replicas confirm completion, which can take longer than the previous 10 second timeout on larger clusters. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

RunSql runs arbitrary SQL statements that may not support ON CLUSTER syntax (e.g., queries, DML, or statements that already contain ON CLUSTER). Override execute() to use per-node execution similar to InsertIntoSelect. This fixes the distributed_migrations test failures where RunSql was only executing on one node but subsequent AddIndex operations using ON CLUSTER expected the column to exist on all nodes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add _execute_per_node() method to SqlOperation base class and use it in both RunSql and InsertIntoSelect instead of duplicating the same logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The previous approach of patching ClickhouseCluster class methods didn't work reliably because get_cluster() returns cached cluster instances. Instead, patch get_cluster directly to return a mock cluster with the desired configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update all tests in test_operations.py to mock get_cluster so they work consistently in both single-node and multi-node test environments. This ensures tests produce deterministic results regardless of cluster config. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

snuba/migrations/operations.py

volokluev · 2026-01-26T17:04:00Z

snuba/migrations/operations.py

+    def execute(self) -> None:
+        self._execute_per_node()


if the statement contains ON CLUSTER syntax but you execute it per node, will thaet cause a problem?

The query will fail.

The class docstring is incorrect then and this seems strange to me because it implies that RunSql migrations can only be done on a single node basis? If that's the case then we should either validate the operation when it is instantiated that it does not contain ON CLUSTER or make running the query per node an optional behavior, not the default

Good point.

- Remove SQL truncation in migration logs (was truncating to 32 chars) - Add test verifying ON CLUSTER DDL fails on single-node without Zookeeper The test documents that ON CLUSTER operations require Zookeeper/Keeper for distributed DDL coordination, validating that our migration code correctly uses is_single_node() to avoid ON CLUSTER on single nodes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

RunSql now checks if the SQL statement contains ON CLUSTER: - If present: uses single-node execution (parent's execute()) - If absent: uses per-node execution (_execute_per_node()) This allows RunSql to properly handle both cases: - SQL with explicit ON CLUSTER that should only run once - SQL without ON CLUSTER that needs to run on each node Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

volokluev

Once the tests pass, feel free to merge

The test assumed Zookeeper wasn't configured, which isn't true in CI. The ON CLUSTER behavior is already covered by unit tests that mock the cluster configuration. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

phacops requested a review from a team as a code owner January 25, 2026 20:26

phacops and others added 6 commits January 25, 2026 12:40

refactor(migrations): Extract per-node execution into common method

3c79cdc

Add _execute_per_node() method to SqlOperation base class and use it in both RunSql and InsertIntoSelect instead of duplicating the same logic. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

volokluev reviewed Jan 26, 2026

View reviewed changes

snuba/migrations/operations.py Outdated Show resolved Hide resolved

volokluev reviewed Jan 26, 2026

View reviewed changes

phacops and others added 2 commits January 26, 2026 10:02

phacops requested a review from volokluev January 26, 2026 18:25

volokluev approved these changes Jan 26, 2026

View reviewed changes

phacops merged commit 28344a9 into master Jan 26, 2026
33 checks passed

phacops deleted the ref/migrations-on-cluster-ddl branch January 26, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ref(migrations): Use ON CLUSTER for DDL statements #7668

ref(migrations): Use ON CLUSTER for DDL statements #7668

phacops commented Jan 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

volokluev Jan 26, 2026

Uh oh!

phacops Jan 26, 2026

Uh oh!

volokluev Jan 26, 2026

Uh oh!

phacops Jan 26, 2026

Uh oh!

volokluev left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

ref(migrations): Use ON CLUSTER for DDL statements #7668

ref(migrations): Use ON CLUSTER for DDL statements #7668

Conversation

phacops commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

volokluev Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

phacops Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

volokluev Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

phacops Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

volokluev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

phacops commented Jan 25, 2026 •

edited

Loading