add s3_tier storage policy with renamed disks#589
Conversation
|
PR Review posted via separate message - see next comment for full review |
PR Review: Add s3_tier storage policy with renamed disksOverall this is a clean, well-structured implementation. The architecture is solid -- constant -> config -> event -> manifest builder -> K8s env var -> XML -- and the disk renames (default->local, s3_cache->s3) improve self-documentation. A few things worth addressing. |
Bug / CorrectnessMissing input validation for s3TierMoveFactor (ClickHouseInit.kt) The valid range for ClickHouse move_factor is 0.0 to 1.0. Nothing prevents a user from passing a value outside this range (e.g. 1.5 or a negative), which would produce a broken ClickHouse config at runtime. Recommend adding a guard in execute(): |
ClarityOption description is slightly backwards (ClickHouseInit.kt line ~43) Current: "Fraction of local disk free space that triggers data move to S3 tier" This reads as though the amount of free space triggers the move, but ClickHouse moves data when free space falls below this fraction. At 0.2, data moves when less than 20% of local disk is free (80% full). Suggested wording: "Move data to S3 tier when local disk free space falls below this fraction (0.0-1.0, default: DEFAULT-VALUE)" |
MinorNo table cleanup in E2E test (bin/end-to-end-test) test_s3_tier is created but never dropped. Not a functional problem, but adding a DROP TABLE IF EXISTS test_s3_tier at the end keeps the cluster state tidy between step reruns. Everything else looks correct: Fabric8 for K8s manifests, typed events, constants in Constants.ClickHouse, meaningful tests (not mock-echo), OpenSpec docs all present. |
|
@claude assess the failing tests and propose fixes |
|
Claude finished @rustyrazorblade's task in 2m 2s —— View job Failing Tests Fixed
Root cause: Fix (commit
|
e0ff13f to
4fd5060
Compare
|
PR Review: add s3_tier storage policy with renamed disks The overall structure is solid and follows established patterns well. The data flow (constant to config data class to event to manifest builder to ConfigMap key to env var to XML from_env attribute) is consistent with how existing ClickHouse parameters like s3CacheSize and s3CacheOnWrite are handled. Fabric8 is used correctly, constants live in Constants.ClickHouse, and tests exercise real logic. A few issues to address before merging: BUG: No input validation on --s3-tier-move-factor ClickHouse requires move_factor to be in [0.0, 1.0]. Nothing currently prevents passing -1.0 or 2.0, which would produce a broken runtime config with no CLI-level error. Add a guard in ClickHouseInit.execute(): A unit test covering the invalid range case should accompany this. UX: Option description is semantically backwards This implies more free space triggers a move, but ClickHouse move_factor semantics are the opposite: 0.2 means move data when less than 20 percent of local disk is free (80% full). Suggested replacement: Move data to S3 tier when local disk free space falls below this fraction (0.0-1.0) DOCS: docs/user-guide/clickhouse.md not updated The storage policies section documents only local and s3_main. The new s3_tier policy and --s3-tier-move-factor option should be added to the policies comparison table and the clickhouse init options table. Per project convention, user-facing feature changes require doc updates. MINOR: E2E test table not cleaned up test_s3_tier is created in step_clickhouse_s3_tier_test but never dropped. Adding DROP TABLE IF EXISTS test_s3_tier at the end keeps reruns clean and avoids leftover state between runs. TESTS: Coverage is good but missing validation test The added tests are meaningful:
What is missing: a test verifying that an invalid move_factor (e.g. 1.5) is rejected at the CLI level. This pairs with the validation fix above. The disk rename (s3_cache to s3) and the new tiered policy are architecturally sound. Addressing the validation gap and docs are the main blockers. |
Code Coverage
|
4fd5060 to
dcd1ca2
Compare
Review SummaryA previous review pass has already identified the main issues. Confirming and prioritizing: Must fix before merge:
Should fix:
Minor:
The Kotlin implementation is correct - the constant to config to event to ConfigMap to env var to XML from_env chain follows existing patterns, Fabric8 is used throughout, and the two new manifest builder tests exercise real logic. |
Add a new s3_tier storage policy that provides automatic data movement from local disks to S3 based on disk space availability. This combines local performance for hot data with S3's unlimited capacity for cold data. Key features: - Starts with local NVMe storage for best write performance - Automatically moves oldest data to S3 when disk free space falls below configured threshold (default: 20%) - Configurable via --s3-tier-move-factor (range: 0.0-1.0) - Data on S3 remains queryable with cache-assisted reads Implementation: - Add s3_tier policy with local and s3 disk volumes to storage.xml - Add --s3-tier-move-factor option to clickhouse init command - Include input validation to reject values outside [0.0, 1.0] - Update ClickHouse config generation to use renamed disk identifiers (local_nvme, s3) for consistency across all policies - Add comprehensive unit tests for boundary validation - Add E2E test verifying data movement to S3 Documentation: - Update docs/user-guide/clickhouse.md with s3_tier policy guide - Add policy comparison table including all three policies - Document when to use tiered storage and how it works
11479f3 to
3360b74
Compare
PR Review: Add s3_tier storage policy with renamed disksGood implementation overall -- clean layering, proper use of the event bus, constants in the right place, and solid test coverage for the validation logic. A few items worth discussing: Potential Breaking Change: Disk Rename The rename of End-to-End Test: Cleanup on Failure In Borderline Mock-Echo Test
Minor:
Nit: The docs table now includes The core feature (config.xml policy, env var injection, ConfigMap wiring, Kotlin config class, validation) looks correct and follows existing patterns well. The design doc is a useful addition. |
|
Code Review: The data flow from CLI option to state to ConfigMap to env var to config.xml is clean. A few things worth addressing: (1) Disk renames are a breaking change for existing clusters - worth noting in the PR so users know to recreate rather than restart. (2) Test cleanup in step_clickhouse_s3_tier_test only runs on success - consider a trap. (3) Validation via require() throws a stack trace; prefer eventBus.emit() per project conventions. (4) The test asserting default s3TierMoveFactor is saved is a banned mock-echo pattern per CLAUDE.md. (5) Minor: default parameter value on s3TierMoveFactor in buildAllResources/buildClusterConfigMap lets callers silently omit it. |
|
@claude do a thorough review of the branch. Clickhouse pod crashing on startup during end to end test. Look for configuration related errors. Can the same disk be configured as a cold tier and the main disk in different policies? Also look at the logs and double check the mount paths |
|
Claude finished @rustyrazorblade's task in 4m 49s —— View job Thorough Review: ClickHouse Startup Crash Investigation
Root Cause of the Startup Crash
In <local>
<type>local</type>
<path>/mnt/db1/clickhouse/</path>
<keep_free_space_bytes>1073741824</keep_free_space_bytes>
</local>ClickHouse always creates an implicit
Why the old code worked: The original FixRevert the disk definition to use <local>
<type>local</type>
<path>/mnt/db1/clickhouse/</path>
<keep_free_space_bytes>1073741824</keep_free_space_bytes>
</local>
<default>
<keep_free_space_bytes>1073741824</keep_free_space_bytes>
</default>Then update the two policy volume references from Can the Same Disk Be in Multiple Policies?Yes — this is valid ClickHouse behavior. The Mount Path Audit — No Issues FoundAll paths in config.xml are under
Mount paths are correct. Secondary IssuesE2E test cleanup is now present but only on success. The Summary: The crash is caused by the |
|
@claude do the fix in this branch |
|
Claude finished @rustyrazorblade's task in 1m 31s —— View job Fix: ClickHouse startup crash (disk path conflict)
Root cause: The Fix (commit
The policy names ( |
…flict
ClickHouse enforces unique paths for all local disks. The renamed 'local'
disk was defining path '/mnt/db1/clickhouse/' which conflicts with the
implicit 'default' disk (created from the server-level <path> setting).
This caused ClickHouse to crash on startup with a path conflict error.
Fix: rename the disk entry back to 'default' (removing the redundant
<type> and <path> which are implicit for the default disk) and update
the two <disk>local</disk> policy references to <disk>default</disk>.
The policy names ('local', 's3_tier') are unchanged.
Co-authored-by: Jon Haddad <rustyrazorblade@users.noreply.github.com>
Adds five new skills to enhance development workflow and knowledge access: Execution Skills: - debug-environment: Active cluster debugging with auto-diagnosis - Validates environment files and connectivity - Analyzes pod/service failures with structured reports - Provides 50+ common issue troubleshooting scenarios - e2e-test: Intelligent test runner with auto-detection - Auto-detects test scope from code changes - Runs non-interactively with --no-teardown support - Automatically debugs failures using debug-environment Expert Skills (Q&A agents): - easy-db-lab-expert: General architecture and feature knowledge - Commands, configuration, observability, development patterns - k8-expert: Kubernetes and Fabric8 manifest builder specialist - K3s architecture, StatefulSets, pod troubleshooting - e2e-test-expert: Test suite specialist - Test scope selection, failure interpretation, cost estimates Script Enhancement: - bin/end-to-end-test: Add --no-teardown flag for automation - Enables non-interactive test execution - Exit code indicates pass (0) or fail (1) - Leaves cluster running for debugging Each skill includes comprehensive documentation: - SKILL.md: Agent instructions with dynamic context - README.md: User documentation with examples - reference/: Detailed troubleshooting and command guides - templates/: Structured output formats Integration: - e2e-test automatically invokes debug-environment on failures - Skills prevent duplicate cluster creation - Experts reference each other for specialized questions - Supports full automated test-diagnose-fix workflow
The end-to-end-test script handles building automatically and uses --clean flag to remove old files. No need for the skill to check builds or suggest cleaning - the script does everything.
|
Code Review - PR 589 (s3_tier storage policy) Summary of findings:
|
The skill now actively monitors test execution and reports: - Step transitions as they occur - Step outcomes (pass/fail) - Progress updates every few steps - Immediate error detection - Milestone completions Provides full transparency during 15-60 minute test runs instead of silent execution.
CRITICAL SAFETY RULE: The e2e-test skill is now explicitly prohibited from using rm commands for any reason. Why: - Test script handles cleanup via --clean flag - Files are never stale (clusters are fresh) - Manual deletion can break active clusters - User may need files for debugging Added prominent warnings in both SKILL.md and README.md with no exceptions clause.
When tests fail, the e2e-test skill now uses agent teams (if available) for immediate parallel investigation: Main agent: - Processes test results - Summarizes failed steps - Prepares preliminary report Team member agent (in parallel): - Checks K8s pods and services - Reviews logs and events - SSHs to nodes to check systemd services - Inspects cluster resources - Identifies failure patterns Team member is read-only (can inspect but not modify). Combined findings from both agents provide faster root cause diagnosis. Fallback: If agent teams unavailable, uses sequential debugging with debug-environment skill.
Changed from single-agent execution to coordinated multi-agent pattern: Agent Roles: 1. Main Agent (Coordinator): - Determines test scope from code changes - Delegates test execution to team member - Relays progress to user - Coordinates investigation on failures - Synthesizes findings from multiple agents - Provides final recommendations 2. Test Runner Team Member: - Executes bin/end-to-end-test - Monitors output in real-time - Reports step transitions and outcomes - Provides test-level failure context 3. Investigation Team Member (on failure): - Investigates live cluster state (read-only) - Checks pods, services, logs, events - SSHs to nodes for systemd status - Reports cluster-level findings Benefits: - Parallel investigation starts immediately - Multiple perspectives (test + cluster state) - Faster time to diagnosis - Coordinated single analysis from multiple sources Fallback to sequential debugging if agent teams unavailable.
|
PR Review: ClickHouse S3 Tier Storage Policy. Overall clean, well-structured implementation consistent with existing patterns. Issues: (1) Inaccurate PR description - only s3_cache->s3 rename happened, not default->local. (2) Test using .first should use firstOrNull for clearer failures. (3) E2E test should verify row count after S3 move. (4) move_factor=1.0 semantics should be documented. Positives: require() validation is idiomatic, boundary tests are thorough, s3TierMoveFactor has correct default for backward compat, from_env pattern is consistent, no-teardown flag is useful for CI, docs are comprehensive. |
Adds S3 cold tier support for ClickHouse:
s3_tierstorage policy: local NVMe as hot tier, S3 as cold tier (data moves automatically when local disk fills up)default→local,s3_cache→s3--s3-tier-move-factoroption onclickhouse init(default: 0.2 — moves data when local disk is 80% full)Closes #582