Skip to content

Conversation

@jakeloo
Copy link
Member

@jakeloo jakeloo commented Sep 3, 2025

Summary by CodeRabbit

  • New Features
    • Async inserts and compression are now enabled by default for ClickHouse across orchestrator, main, staging, and migrator components.
    • Introduced default async insert settings: 3s busy timeout, 100MB maximum data size, and 100k maximum concurrent queries.

@coderabbitai
Copy link

coderabbitai bot commented Sep 3, 2025

Walkthrough

Default values for several ClickHouse-related CLI flags were switched to true, enabling async inserts and compression by default. When async inserts are enabled, additional ClickHouse settings (busy timeout, max data size, max query number) are applied during storage configuration.

Changes

Cohort / File(s) Summary
CLI defaults: async insert & compression
cmd/root.go
Changed default values of multiple boolean flags to true: storage-orchestrator-clickhouse-asyncInsert, storage-orchestrator-clickhouse-enableCompression, storage-main-clickhouse-asyncInsert, storage-main-clickhouse-enableCompression, storage-staging-clickhouse-asyncInsert, storage-staging-clickhouse-enableCompression, migrator-destination-clickhouse-asyncInsert.
ClickHouse async insert tuning
internal/storage/clickhouse.go
When AsyncInsert is enabled, set ClickHouse settings: async_insert_busy_timeout_ms=3000, async_insert_max_data_size=104857600, async_insert_max_query_number=100000.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant CLI as CLI Flags
    participant Config as Config Loader
    participant Orchestrator as Storage Orchestrator
    participant CH as ClickHouse Client

    CLI->>Config: Load defaults (asyncInsert=true, compression=true)
    Config->>Orchestrator: Provide storage config
    alt AsyncInsert enabled
        Orchestrator->>CH: Apply async_insert_* settings<br/>(timeout=3000ms, max_data=100MB, max_queries=100k)
        Orchestrator->>CH: Insert (async)
        note right of CH: Server queues async inserts
    else AsyncInsert disabled
        Orchestrator->>CH: Insert (sync)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jl/optimize-clickhouse-async-insert

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jakeloo jakeloo merged commit 63e4887 into main Sep 3, 2025
4 of 5 checks passed
@jakeloo jakeloo deleted the jl/optimize-clickhouse-async-insert branch September 3, 2025 16:31
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/storage/clickhouse.go (1)

130-133: Use canonical setting name, numeric types, and surface these values via config

  • Prefer the canonical key async_insert_busy_timeout_max_ms (alias of async_insert_busy_timeout_ms) to match docs and avoid confusion. (github.com, fossies.org)
  • Pass numeric values (ints) instead of strings for numeric settings; the driver accepts interface{} and this removes quoting ambiguity.
  • Optional: log the enabled async-insert parameters once to aid runtime diagnostics.
  • Optional: consider exposing these as cfg fields (with CLI flags) so they can be tuned per env without code changes.

Apply:

-        settings["async_insert_busy_timeout_ms"] = "3000"    // 3 seconds
-        settings["async_insert_max_data_size"] = "104857600" // 100MB
-        settings["async_insert_max_query_number"] = "100000" // 100k
+        // Canonical name; numeric values
+        settings["async_insert_busy_timeout_max_ms"] = 3000      // 3s
+        settings["async_insert_max_data_size"] = 100 << 20       // 100 MiB
+        settings["async_insert_max_query_number"] = 100_000
+        zLog.Debug().
+          Int("async_insert_busy_timeout_ms", 3000).
+          Int("async_insert_max_data_size", 100<<20).
+          Int("async_insert_max_query_number", 100_000).
+          Msg("ClickHouse async insert enabled")

Note: Keeping wait_for_async_insert = 1 is aligned with ClickHouse guidance for production reliability. (clickhouse.com)

cmd/root.go (1)

83-83: Behavior change: asyncInsert and compression default to true — plan rollout and safeguards

Great defaults for throughput, but this flips behavior for users relying on previous defaults.

Actions:

  • Announce in release notes and config docs; include opt-out examples (flags/env).
  • Validate ClickHouse capacity (CPU for LZ4, memory for async buffers) before broad enablement.
  • Consider adding flags for async-insert tunables (busy timeout/data size/query number) to complement the new defaults.
  • Add a feature flag or env (e.g., TW_DISABLE_CLICKHOUSE_ASYNC_DEFAULTS) if you need a quick rollback path.

Checklist to run before merge:

  • Staging soak with these defaults on realistic write rates; watch system.asynchronous_insert_log, insert latencies, parts/sec, merges, and memory.
  • Confirm no unexpected changes to ingest error semantics due to wait_for_async_insert=1.

Also applies to: 89-89, 106-106, 112-112, 117-117, 123-123, 201-201

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2914e3f and baba962.

📒 Files selected for processing (2)
  • cmd/root.go (3 hunks)
  • internal/storage/clickhouse.go (1 hunks)

Comment on lines +130 to 133
settings["async_insert_busy_timeout_ms"] = "3000" // 3 seconds
settings["async_insert_max_data_size"] = "104857600" // 100MB
settings["async_insert_max_query_number"] = "100000" // 100k
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Double-check interaction with Native batch inserts

This connector predominantly uses PrepareBatch (Native protocol). Please verify your ClickHouse version and driver behavior to ensure async_insert settings apply as intended to your insert path; if not, these settings won’t influence batch.Send() semantics. If needed, document the expected behavior or gate settings by mode.

You can confirm server-side support and effective values with:

-- Run on the target cluster as the same user this app uses
SELECT name, value, changed, origin
FROM system.settings
WHERE name IN (
  'async_insert','wait_for_async_insert',
  'async_insert_busy_timeout_max_ms','async_insert_max_data_size','async_insert_max_query_number'
);

Also check that your ClickHouse version documents these settings:


🌐 Web query:

Verify whether ClickHouse async_insert settings (busy_timeout_max_ms, max_data_size, max_query_number) apply to Native protocol inserts, and whether busy_timeout_ms is aliased to busy_timeout_max_ms.

💡 Result:

Short answers:

  • The async_insert settings (async_insert_busy_timeout_max_ms, async_insert_max_data_size, async_insert_max_query_number) control ClickHouse's asynchronous-insert buffering/flush behavior — they apply to async INSERT processing, not to every insert mode. [1][2]

  • Native/binary-protocol inserts do not universally use the async_insert path: many ClickHouse clients/SDKs only support async_insert when sending SQL/text INSERTs (the native Insert API in some clients bypasses async buffering). See the clickhouse-cpp notes that async inserts work for SQL/text but the native Block Insert path does not use it. [2][4]

  • async_insert_busy_timeout_ms is an alias for async_insert_busy_timeout_max_ms. [3]

Sources:

  1. ClickHouse settings docs (async_insert, async_insert_max_data_size, async_insert_max_query_number, busy_timeout descriptions). [clickhouse.com docs].
  2. clickhouse-cpp GitHub README/issues explaining async_insert requires SQL/text inserts and native Insert() (binary Block) doesn’t use async insert. [github.com/ClickHouse/clickhouse-cpp].
  3. ClickHouse source / settings history showing async_insert_busy_timeout_ms aliased to async_insert_busy_timeout_max_ms. [fossies / ClickHouse source].

Document or gate async_insert settings for native protocol

Current async_insert settings (async_insert_busy_timeout_ms/async_insert_busy_timeout_max_ms, async_insert_max_data_size, async_insert_max_query_number) apply only to ClickHouse’s async INSERT (SQL/text) path and are ignored by the native binary protocol used by PrepareBatch/batch.Send(). Remove or conditionally apply these settings for the native batch insert path, or clearly document that they have no effect in native mode.

🤖 Prompt for AI Agents
In internal/storage/clickhouse.go around lines 130 to 133, the async_insert_*
settings are being applied unconditionally but only affect ClickHouse’s SQL/text
async INSERT path and are ignored by the native binary protocol used by
PrepareBatch/batch.Send(); either stop applying these settings when using the
native client or add a clear comment documenting they have no effect in native
mode. Change the logic to detect the client/protocol (native vs SQL/text) and
only set
async_insert_busy_timeout_ms/async_insert_max_data_size/async_insert_max_query_number
for the SQL/text path, otherwise omit them; if detection is not practical, add
an inline comment and documentation near this block stating these settings do
not apply to native batch inserts so maintainers won’t expect them to affect
PrepareBatch/batch.Send().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants