Skip to content

Conversation

jmcarp
Copy link
Contributor

@jmcarp jmcarp commented Oct 9, 2025

On start, ClickHouse compares the local state of each distributed table to its distributed state. If it finds a discrepancy, it starts the table in read-only mode. When this happens, oximeter can't write new records to the relevant table(s). In the past, we've worked around this by manually instructing ClickHouse using the force_restore_data sentinel file, but this requires manual detection and intervention each time a table starts up in read-only mode. This patch sets the replicated_max_ratio_of_wrong_parts flag to 1.0 so that ClickHouse always accepts local state, and never starts tables in read-only mode.

As described in ClickHouse/ClickHouse#66527, this appears to be a bug, or at least an ergonomic flaw, in ClickHouse. One replica of a table can routinely fall behind the others, e.g. due to restart or network partition, and shouldn't require manual intervention to start back up.

Part of #8595.

Note: I'm not sure now best to test this. It sounds like we have reasonably high confidence that the fix will work, so we could just merge and deploy to dogfood, and revert if necessary. Or is clickhouse cluster running on another rack that we can test?

On start, ClickHouse compares the local state of each distributed table to its
distributed state. If it finds a discrepancy, it starts the table in read-only
mode. When this happens, oximeter can't write new records to the relevant
table(s). In the past, we've worked around this by manually instructing
ClickHouse using the `force_restore_data` sentinel file, but this requires
manual detection and intervention each time a table starts up in read-only
mode. This patch sets the `replicated_max_ratio_of_wrong_parts` flag to 1.0 so
that ClickHouse always accepts local state, and never starts tables in
read-only mode.

As described in ClickHouse/ClickHouse#66527, this
appears to be a bug, or at least an ergonomic flaw, in ClickHouse. One replica
of a table can routinely fall behind the others, e.g. due to restart or network
partition, and shouldn't require manual intervention to start back up.

Part of #8595.
@jmcarp jmcarp requested a review from bnaecker October 9, 2025 15:24
@jmcarp jmcarp force-pushed the jmcarp/clickhouse-readonly-workaround branch from d84e49b to 11b96f2 Compare October 9, 2025 15:29
<!-- Disable sparse column serialization, which we expect to not need -->
<ratio_of_defaults_for_sparse_serialization>1.0</ratio_of_defaults_for_sparse_serialization>
<!-- Prevent ClickHouse from setting distributed tables to read-only. -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc nit, this is setting the local replica to read-only. I.e., it avoids updating the local state with the shared state in the Raft cluster.

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. As far as testing, I think we should figure out how to check if this is being used on the Dogfood rack. That is, can we see in the logs that it would have set the table to read-only mode, but is not because of this setting?

Dogfood is the only place we run the replicated cluster.

@bnaecker
Copy link
Collaborator

bnaecker commented Oct 9, 2025

so that ClickHouse always accepts local state, and never starts tables in read-only mode.

Just a nit here, it's so that ClickHouse always accepts the shared state, not the local state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants