Bulk write via s3 with replication#601
Conversation
PR Review: Bulk Write via S3 with ReplicationOverall this is a solid implementation. The IssuesAWS dependencies applied to In
The cassandra-analytics framework may call LocalStack IAM service declared but never used .withServices(Service.S3, Service.IAM)IAM is never exercised in any of the tests — credentials come from system properties. This adds unnecessary LocalStack startup overhead. Drop
Suggestions
OpenSpec docs are thorough — Minor
|
Code ReviewOverall this is a solid, well-documented implementation. The design doc is thorough and the spec-driven approach is clear. A few things worth addressing: Issues
private DefaultCredentialsProvider credentialsProvider;
@Override
public void initialize(String jobId, SparkConf conf, boolean isOnDriver) {
// ...
this.credentialsProvider = DefaultCredentialsProvider.create();
}
The try {
this.region = new DefaultAwsRegionProviderChain().getRegion().id();
} catch (SdkClientException e) {
throw new IllegalStateException("Unable to detect AWS region. Set AWS_REGION...", e);
}Minor Notes
This is a string constant, so there's no compile-time circular dependency, but it's a coupling smell: the shared common module now knows the FQCN of a consumer module's class. Consider whether this constant belongs in
Added to LocalStack IAM service started but never used The test starts LocalStack with
The integration tests cover missing and empty bucket names, but not the length boundaries (2-char and 64-char inputs). These are trivial to add as unit tests and would cover the logic in What's Done Well
|
|
test |
|
Code Review - Overall this is a solid implementation. The design is well-thought-out, the constants refactoring in SparkJobConfig is clean, and the LocalStack integration tests are a good choice. A few things worth addressing. ISSUES: 1. DefaultAwsRegionProviderChain.getRegion() never returns null - In EasyDbLabStorageExtension.java the null check on the result of getRegion() is unreachable. The SDK method throws SdkClientException when no region can be resolved - it never returns null. Fix by catching the exception instead of checking for null. 2. Stale credentials for long-running jobs - setCredentialChangeListener is a no-op with the comment about credentials being static from instance profile. This is inaccurate - instance profile credentials rotate approximately every hour. DefaultCredentialsProvider handles refresh internally when resolveCredentials() is called, but each getStorageConfiguration() call creates a new StorageCredentials snapshot. If cassandra-analytics calls getStorageConfiguration() once at job start and caches the result, credentials will expire mid-job without notification. The CredentialChangeListener exists precisely for this scenario. If the library supports it, it should be wired up. At minimum, update the comment to accurately describe the risk. 3. TRANSPORT_DIRECT constant is unused - SparkJobConfig.java adds TRANSPORT_DIRECT but this constant is not referenced anywhere in the PR. CLAUDE.md is explicit about not adding things that were not asked for. Remove it unless there is a concrete use in this change. 4. getBooleanProperty error handling - The new generic getProperty helper catches NumberFormatException, but Boolean::parseBoolean never throws - it silently returns false for any non-true string. Wrapping it in a NumberFormatException catch creates the misleading impression that invalid boolean values produce an error. A comment clarifying the silent-false behavior would help. MINOR OBSERVATIONS: 5. Questionable test - lifecycleEvents_executeWithoutErrors only verifies that log-only methods do not throw exceptions. Per CLAUDE.md guidance on no mock-echo tests, tests that would not fail if you deleted the code under test are a concern. 6. getObjectTags API assumption - initialize_withCustomEndpoint_storesEndpoint asserts that getObjectTags() contains the endpoint entry, which depends on StorageTransportConfiguration constructor argument mapping to getObjectTags(). A comment noting this API contract would help future maintainers. 7. multipleExtensionInstances_canCoexist - assertions only check isNotNull(). Stronger assertions verifying the two configs reference different bucket names would make this test more meaningful. POSITIVES: Extracting all magic strings to constants in SparkJobConfig is a clear improvement. DefaultCredentialsProvider + DefaultAwsRegionProviderChain is the right approach, no credentials in Spark config. Error messages are specific and actionable, listing what to check when credentials or region detection fails. LocalStack via TestContainers is the right infrastructure choice here. The generic getProperty refactor eliminates real duplication in SparkJobConfig. Multi-DC expansion path is well-thought-out in the design doc. |
|
FORMATTED REVIEW (replacing previous comment): Code Review -- Overall this is a solid implementation. The design is well-thought-out, the constants refactoring in SparkJobConfig is clean, and the LocalStack integration tests are a good choice. |
|
testing jq approach |
|
Code Review Overall this is a solid implementation. The design is well-thought-out, the constants refactoring in SparkJobConfig is clean, and the LocalStack integration tests are a good choice. A few things worth addressing. |
PR Review: Bulk Write via S3 with ReplicationThis PR delivers a well-architected implementation that finally makes the S3_COMPAT bulk write transport functional. The design decisions are sound (instance profile credentials, auto-detected region, fail-fast validation), the OpenSpec artifacts are thorough, and the documentation is excellent. Here are my findings: Bugs / Correctness1. Credential rotation risk in The most substantive concern: if If cassandra-analytics calls 2. README references non-existent constants The README states: "All property keys and transport mode constants are defined in Neither Code Quality3. Hardcoded dependency versions in Several test dependencies bypass the version catalog: 4. NPE risk in The error message construction calls 5. Behavior change in Previously Test Coverage6. System property cleanup in region detection test The test that validates region detection via 7. Live integration tasks (5.1-5.13) still unchecked The tasks.md correctly flags these as requiring a live EMR environment. That is acceptable, but the E2E step has been added to End-to-End Test8. Missing prerequisite checks in The step checks 9. Unclear The fourth positional argument to Minor
SummaryThe core implementation is solid and the approach is correct. Items 1 (credential rotation) and 2 (stale README constant names) should be addressed before merge. Items 3-5 are quality improvements worth fixing. The rest are minor or informational. |
|
test comment - please ignore |
|
Test2 |
|
PR Review - Part 1 of 3: Issues 1-3 1. Credentials not validated at initialize() time (design doc violation) The design doc says to fail fast during initialize() for missing credentials. But credentialsProvider.resolveCredentials() is only called inside getStorageConfiguration(), not during initialize(). If the instance profile is missing, the error surfaces too late. Fix: call resolveCredentials() once in initialize() to validate credentials are obtainable, then rely on the provider for fresh tokens at runtime. 2. Misleading Javadoc on credential caching The comment on the credentialsProvider field says it avoids repeated IMDS calls if getStorageConfiguration() is called multiple times. But resolveCredentials() is still called on every invocation - the provider is cached, not the credentials. Clarify that the provider is cached so session tokens refresh automatically. 3. DefaultCredentialsProvider resource leak DefaultCredentialsProvider implements Closeable but is never closed. This can leak background credential refresh threads. Call credentialsProvider.close() in onJobSucceeded/onJobFailed, or implement Closeable on the extension. |
|
PR Review Part 2: Issues 4-6 and Minor Observations Issue 4 - Stale README reference to non-existent constant: spark/bulk-writer-s3/README.md line 839 states that EXTENSION_S3_STORAGE is a public constant in SparkJobConfig, but this constant does not exist. The extension class name lives in EasyDbLabStorageExtension.EXTENSION_CLASS_NAME. Update the README. Issue 5 - Duplicate test cases: initialize_withShortBucketName_throwsIllegalArgumentException and initialize_withTwoCharBucketName_throwsIllegalArgumentException both use the exact same input. Remove the duplicate. Issue 6 - getObjectTags() assumption in tests: Tests assert config.getObjectTags().containsEntry and config.getObjectTags().isEmpty(). This assumes StorageTransportConfiguration.getObjectTags() maps to the extraConfig constructor parameter. Confirm this is the correct accessor - if not, these tests pass trivially. Minor: validateBucketName whitespace - a bucket name of spaces passes the empty check but fails at AWS. Consider .trim().isEmpty(). The getProperty null safety: defaultValue.getClass().getSimpleName() will NPE if defaultValue is null. EXTENSION_CLASS_NAME on EasyDbLabStorageExtension could move to SparkJobConfig as OPT_DATA_TRANSPORT_EXTENSION_CLASS_VALUE for consistency with other transport constants. |
|
PR Review Part 3: What's Good
|
|
PR Review test |
PR Review: Bulk write via S3 with replicationOverall this is solid work. The design is well-thought-out, the documentation is thorough, and the test coverage with LocalStack is the right approach. A few issues worth addressing before merge. Issues1. Credentials not validated at initialize() time (design doc violation) The design doc says to fail fast during Fix: call 2. Misleading Javadoc on credential caching The comment on the 3. DefaultCredentialsProvider resource leak
4. Stale README reference to non-existent constant
5. Duplicate test cases
6. getObjectTags() assumption in tests Tests assert Minor Observations
What's Good
|
|
Review posted below in individual sections. |
|
Code Review: Bulk write via S3 with replication Overall this is a solid implementation. EasyDbLabStorageExtension is well-structured, the documentation is thorough, and using LocalStack integration tests over mocks is the right call. A few issues worth addressing: Bug: Credentials provider used after close closeCredentialsProvider() is called in both onJobSucceeded and onJobFailed. The lifecycle test lifecycleEvents_executeWithoutErrors calls onJobSucceeded (which closes the provider) before onStageFailed, onImportFailed, and onJobFailed. The null-check guards against NPE but if cassandra-analytics ever calls getStorageConfiguration() post-job-completion, the closed provider could throw. Consider setting credentialsProvider = null after closing so any subsequent getAwsCredentials() call fails fast with a clear message. Resource leak: no cleanup on cancelled jobs If the job is cancelled and neither onJobSucceeded nor onJobFailed is ever called, DefaultCredentialsProvider background threads are never released. Consider implementing java.io.Closeable on the extension or registering a JVM shutdown hook in initialize(). setCredentialChangeListener no-op comment is misleading The comment says credentials are static from instance profile but instance profile credentials rotate every ~1 hour. DefaultCredentialsProvider handles rotation internally on each resolveCredentials() call, which is correct. The real question is whether cassandra-analytics caches the StorageCredentials object across stages or calls getStorageConfiguration() fresh each time. If it caches, long-running jobs could hit expired credentials. Worth clarifying this assumption in the comment. Minor: AWS SDK version not in version catalog software.amazon.awssdk auth and regions at 2.26.12 are hardcoded in build.gradle.kts. If the main project uses AWS SDK v2 anywhere else, these should come from the version catalog to avoid shadow JAR conflicts. Minor: getBooleanProperty behavior change The new getBooleanProperty throws IllegalArgumentException for non-true/false values where Boolean.parseBoolean() previously silently returned false. This is strictly better (fail fast), just worth noting it is a behavioral change. Good things worth calling out Constants in SparkJobConfig for all magic strings -- refactor across S3BulkWriter and DirectBulkWriter is clean. EXTENSION_CLASS_NAME constant prevents the class name string from drifting out of sync. Credentials provider is cached not the credentials, so session tokens refresh on each resolveCredentials() call. LocalStack integration tests are the right approach. Fail-fast credential validation in initialize() with actionable error messages is well done. docker compose up --wait plus Valkey healthcheck is a good fix for timing-dependent CI flakiness. |
Code ReviewOverall this is a solid implementation of the S3 bulk write transport extension. Good use of fail-fast validation, proper credential provider lifecycle management, and TestContainers for integration testing. A few items to address: Issues1. Hardcoded dependency versions in
|
|
test comment |
|
test edit |
|
PR Review: Bulk write via S3 with replication (see full formatted review posted separately) |
|
Need to get this change into sidecar to allow for IAM credentials: https://issues.apache.org/jira/browse/CASSSIDECAR-415 |
PR Review: Bulk write via S3 with replicationGood work overall. This implements a non-trivial cassandra-analytics integration with solid documentation, appropriate error handling, and a well-structured test suite. Here are my observations. Strengths
Potential Issues Issue 1: objectTags used to pass endpoint configuration In getStorageConfiguration(), the endpoint is stored as an object tag via ImmutableMap.of("endpoint", endpoint). If cassandra-analytics applies objectTags as actual S3 object metadata tags (which is what the name implies), storing an endpoint URL here will not configure the S3 client to use that endpoint. The endpoint is correctly passed via OPT_STORAGE_CLIENT_ENDPOINT_OVERRIDE in S3BulkWriter.run(). Worth confirming what cassandra-analytics does with the objectTags parameter -- if it is purely for S3 object tags, this entry may be a no-op (harmless but confusing). Issue 2: Endpoint not applied to StorageAccessConfiguration The read/write StorageAccessConfiguration objects are constructed with (region, bucket, credentials) but no endpoint override. For LocalStack testing, the endpoint used by cassandra-analytics when Sidecar downloads SSTables would still point to real AWS -- unless cassandra-analytics picks it up from the write options. Worth verifying the LocalStack endpoint gets used end-to-end, not just for the upload path. Issue 3 (potential blocker): PROP_S3_READ_BUCKETS is now mandatory The extension throws IllegalArgumentException if spark.easydblab.s3.readBuckets is missing. However, the end-to-end test step (step_bulk_writer_s3) only passes --conf spark.easydblab.s3.bucket=$data_bucket with no readBuckets. This test may fail at extension initialize() with "Required property not set". Either the e2e test needs to pass readBuckets, or the extension should fall back to single-DC mode when readBuckets is omitted (write bucket = read bucket). The fallback approach would also make adoption easier for single-DC use cases. Issue 4: Bucket name validation is incomplete validateBucketName() only checks length and emptiness. AWS S3 bucket names must also be lowercase, contain no underscores, and not start/end with a hyphen. An invalid name would result in a confusing S3 API error rather than a clear validation message. Minor Issues Issue 5: Region detection order dependency in some tests Tests like initialize_withEmptyBucketName_throwsIllegalArgumentException set PROP_S3_BUCKET to "" but not PROP_S3_READ_BUCKETS. Currently validateBucketName("") throws before region detection, so they pass. If check order ever changes, they would start failing with "Unable to detect AWS region" in CI. Setting AWS_REGION=us-east-1 via @BeforeAll would make this more robust. Issue 6: Tasks.md has stale unchecked items Tasks 2.3, 2.4, 2.8, 2.9, 2.10, 4.1, and 4.2 are marked unchecked but are actually implemented in the code. The unit tests in section 6 are also implemented in the integration test file. Worth checking these off before merging. Summary The core implementation is sound and well-documented. The main item to address before merging is issue 3 -- verify whether the e2e test step needs readBuckets added, or add a single-DC fallback to the extension. Everything else is minor or informational. |
Code Review - PR 601 - Placeholder - to be replaced |
|
test body via --body flag |
|
Full review below - posting in sections due to technical constraints. |
Code ReviewThis is a solid architectural change replacing per-host SSH/systemd sidecar management with a K3s DaemonSet is cleaner and eliminates a lot of per-node coordination logic. The S3 bulk write transport extension is well-designed with good multi-DC support. A few issues worth addressing. Bugs / Correctness Issues
|
|
Design / Code Quality
Test Coverage
Minor Nits
What Is Working Well
|
|
PR Review: Bulk write via S3 with replication Substantial PR covering two major architectural changes: containerizing the Cassandra sidecar as a K3s DaemonSet, and adding EasyDbLabStorageExtension for S3_COMPAT bulk write transport. Direction is solid. A few things to look at: Potential Bug: objectTags endpoint usage In EasyDbLabStorageExtension.getStorageConfiguration(), objectTags is set to ImmutableMap.of("endpoint", endpoint) when a custom endpoint is configured. S3 object tags are metadata attached to stored objects, not routing configuration. The design doc shows an empty map here. Worth verifying how cassandra-analytics uses objectTags before merging. Pinned Image Tags Both the sidecar image default (ghcr.io/apache/cassandra-sidecar:latest) and the init container (busybox:latest) use floating latest tags. Image drift is silent - a sidecar upgrade could break bulk write compatibility without any config change. Consider pinning. Silent no-ops when control host is missing In Start.kt, Stop.kt, and Restart.kt, when getControlHost() returns null the sidecar operation is silently skipped with no user feedback. This should emit a warning event. parseReadBuckets: empty keys/values accepted The parser splits on colon but does not validate that clusterId or bucket are non-empty. A typo like dc1:bucket1,,dc2:bucket2 would silently produce an empty-string cluster ID. Open tasks in tasks.md Items 4.1, 4.2, 6.1-6.5, 7.1-7.2 are still unchecked. 4.1/4.2 appear done in the code. Unit tests 6.1-6.5 for parseReadBuckets edge cases would be fast and do not need containers. Minor: DEFAULT_SIDECAR_IMAGE visibility Start.DEFAULT_SIDECAR_IMAGE is public const but only used in Start.kt. SidecarManifestBuilder has its own identical DEFAULT_IMAGE private constant. Consolidate into Constants.kt or make the Start one private. Positive notes
|
PR Review: Bulk write via S3 with replicationGood work overall. The migration from per-host systemd sidecar management to a K3s DaemonSet is the right architectural direction, and the new EasyDbLabStorageExtension is cleanly designed. IssuesStart.kt - sidecar deploy failure is silently swallowed onFailure emits a warning event but execution continues normally. If the sidecar DaemonSet fails to deploy, a subsequent bulk-write job will fail in a confusing way. Consider making this a hard failure if the sidecar is required. Server.kt - VPC name resolution has a TOCTOU race exists() and load() are two separate reads of cluster state. If the state file is deleted between them, load() will throw. Combine into a single runCatching block. StatusCache.kt - exitProcess(0) bypasses graceful shutdown The default onShutdown lambda calls exitProcess(0), which skips all Ktor lifecycle cleanup, log flushes, and in-flight request handling. Consider signaling the Ktor ApplicationEngine to shut down gracefully instead. The injectable onShutdown for testing is good design. EasyDbLabStorageExtension.java - endpoint stored in object tags is unexplained Using S3 object tags to carry an endpoint override is unconventional. It appears to be a workaround for a StorageTransportConfiguration API limitation. Add a comment explaining why this approach was chosen so future maintainers do not wonder if it was intentional. EasyDbLabStorageExtension.java - parseReadBuckets does not validate empty parts An entry like :mybucket or dc1: will silently produce an empty key or value. Add validation similar to validateBucketName. Minor observationsSidecarManifestBuilder.kt - busybox:latest is unpinned Pin to a stable version like busybox:1.36 to avoid surprises if the latest tag moves. cassandra-sidecar.yaml - hardcoded default credentials The username/password cassandra block is intentional for lab use, but a comment stating this explicitly would help future readers who might flag it as a security oversight. SparkJobConfig.java - getBooleanProperty strictness is a breaking change The old Boolean.parseBoolean silently returned false for non-true strings (e.g. yes, 1). The new implementation throws IllegalArgumentException for those values. This is the correct behavior, but worth noting in case any callers rely on the lenient parsing. Positives
|
Code ReviewThis PR implements two major features: (1) containerizing the Cassandra sidecar as a K3s DaemonSet, and (2) adding S3 bulk write transport via a new EasyDbLabStorageExtension. Good overall direction. Functional Bug: Pyroscope Hostname Label Will Not Expand SidecarManifestBuilder.kt - buildJavaToolOptions The method builds a JAVA_TOOL_OPTIONS string containing -Dpyroscope.labels=hostname:NODE_NAME_VAR,cluster:.... In Kotlin, the backslash-escaped NODE_NAME_VAR produces the literal string NODE_NAME_VAR in the container env. Kubernetes does not perform variable substitution inside env.value fields referencing other env vars - that only works in command/args arrays. So the Pyroscope hostname label will be the literal unexpanded string at runtime, not the actual node name. Fix: use a shell entrypoint wrapper that substitutes the NODE_NAME env var before starting the JVM. Potential Bugs Silent no-op on missing control host (Start.kt, Stop.kt, Restart.kt): If no control host exists in cluster state, the sidecar operation silently skips with zero user feedback. Should emit an event explaining why. Race condition in VPC name resolution (Server.kt): clusterStateManager.exists() and clusterStateManager.load() are two separate calls - the state file could be deleted between them. Use a single try/catch around load(). getStorageConfiguration() NPE before initialize() (EasyDbLabStorageExtension.java): credentialsProvider, writeBucket, region etc. are all null until initialize() is called. Calling getStorageConfiguration() first throws a NPE with no meaningful message. Add a guard at the top of the method. StatusCache.checkVpcExists() triggers on first refresh: The VPC check runs on the very first background cycle. A slow AWS response during startup could trigger spurious shutdown. Missing Tests SidecarManifestBuilder has no unit tests. The builder generates non-trivial K8s manifests (ConfigMap, DaemonSet with init container, hostNetwork, nodeSelector, Downward API env vars). At minimum, tests should verify: correct namespace, nodeSelector type=db, init container sed command, HOST_IP sourced from fieldRef status.hostIP, and volume name/mount path consistency. Start command --sidecar-image flag is untested. StartTest verifies deploy() is called but never asserts the image argument matches the CLI flag value. EasyDbLabStorageExtension.parseReadBuckets error paths are untested. No test exercises malformed input (no colon, empty cluster ID, empty bucket name). StatusCacheTest - no test for autoShutdown=true with VPC found (should NOT shut down). Architecture DefaultSidecarService loads cluster state internally just to get clusterName. Since deploy() already takes controlHost and image, passing clusterName as a third parameter would remove the ClusterStateManager dependency, making the service simpler to test. onShutdown defaults to exitProcess(0) in StatusCache. This bypasses Ktor graceful shutdown and JVM shutdown hooks. Consider signaling the Ktor engine to stop instead. buildDaemonSet suppresses LongMethod at 130+ lines. Extract buildInitContainer(), buildMainContainer(), and buildVolumes() as private helpers. Minor Issues
Summary High-priority before merge:
|
|
Good overall direction -- containerizing the sidecar as a K3s DaemonSet is a cleaner architecture, and the S3 transport extension with LocalStack-based integration tests is solid work. A few things to flag before merging. Potential Bug: Sidecar config path not passed to container SidecarManifestBuilder mounts the processed config at /conf/sidecar.yaml but never tells the sidecar container where to find it. The original systemd service used -Dsidecar.config=file:///etc/cassandra-sidecar/cassandra-sidecar.yaml. Without an equivalent flag in the container args or JAVA_TOOL_OPTIONS, the sidecar falls back to the image default path -- which will not be /conf/sidecar.yaml. The main container needs .withArgs("-Dsidecar.config=file:///conf/sidecar.yaml") or equivalent. Silent failure when control host is missing (Start.kt) Stop and Restart emit a SidecarStopping/SidecarRestarting event before their early-return guard, so the user sees something. Start just silently skips deployment when controlHost is null. At minimum emit Event.Cassandra.SidecarStartFailed or a warning event in the else branch. Floating image tags Both busybox:latest (init container) and the default ghcr.io/apache/cassandra-sidecar:latest are unpinned. Consider pinning busybox to a specific version (e.g. busybox:1.36) so pod restarts do not silently pull a different image. DEFAULT_SIDECAR_IMAGE constant duplication The same image string appears in Start.DEFAULT_SIDECAR_IMAGE and SidecarManifestBuilder.DEFAULT_IMAGE. They should share one source of truth -- expose the constant from SidecarManifestBuilder and reference it from Start, or move both to Constants. parseReadBuckets -- empty clusterId/bucket not validated Input like 'dc1:bucket1,,dc2:bucket2' or 'dc1:' produces empty keys/values in the map. Worth adding a check after splitting on colon to throw IllegalArgumentException if either clusterId or bucket is empty. Positive observations
|
|
PR Review: Bulk write via S3 with replication Good work overall. The S3 transport is clean, the Fabric8-based SidecarManifestBuilder is well-structured, and the TestContainers/LocalStack integration tests are exactly right. A few issues to address: Bug: Missing spark.easydblab.s3.readBuckets in e2e test step EasyDbLabStorageExtension.initialize() unconditionally requires PROP_S3_READ_BUCKETS and throws IllegalArgumentException if missing. But step_bulk_writer_s3 only passes spark.easydblab.s3.bucket. For a single-cluster test the read bucket equals the write bucket, so either make PROP_S3_READ_BUCKETS optional for single-DC or add it to the step (e.g. --conf spark.easydblab.s3.readBuckets=DC:BUCKET). This will fail at runtime as-is. Missing K3s integration test for SidecarManifestBuilder configuration/CLAUDE.md is explicit: all manifest builders under configuration/ MUST be tested with K3s TestContainers (apply test, image-pull test, no-resource-limits test). SidecarManifestBuilder is not in K8sServiceIntegrationTest.collectAllResources(). SidecarServiceTest.kt covers service orchestration but does not satisfy the manifest builder requirement. Duplicate default sidecar image constant SidecarManifestBuilder.kt:36 declares private const val DEFAULT_IMAGE and Start.kt:40 duplicates the same string as DEFAULT_SIDECAR_IMAGE. Making it internal const val in the companion would let Start.kt reference it directly. Minor: credential resolution on every getStorageConfiguration() call EasyDbLabStorageExtension.java:130 calls credentialsProvider.resolveCredentials() on every invocation. The provider is cached but resolution can still trigger an IMDS call. Low impact for bulk writes, but worth noting if cassandra-analytics calls this method repeatedly. What is good docker compose --wait is a solid improvement over up -d. Port 1080 conflict check prevents confusing failures. bin/dev auto-starting the container is better UX. EasyDbLabStorageExtensionIntegrationTest with LocalStack is the right approach. SidecarManifestBuilder is clean Fabric8 with no raw YAML strings. |
|
PR Review: Bulk write via S3 with replication. This is a solid PR. A few issues worth addressing: 1) POTENTIAL BUG: NODE_NAME may not resolve in JAVA_TOOL_OPTIONS - SidecarManifestBuilder.kt builds the Pyroscope label as a literal string dollar-sign(NODE_NAME) inside JAVA_TOOL_OPTIONS. The JVM does not perform shell-style env var substitution in system properties. Verify that Pyroscope Java agent supports this interpolation in label values. If not, every sidecar pod will report hostname=dollar-sign(NODE_NAME) instead of the actual node name. Fix: use a separate PYROSCOPE_LABELS env var where K8s natively expands dollar-sign(NODE_NAME) references. 2) SILENT SKIP when control host is missing - Stop.kt and Restart.kt silently return with no event if no control host. Same issue in Start.kt. Should emit a warning event. 3) DUPLICATE DEFAULT IMAGE CONSTANT - Start.kt and SidecarManifestBuilder.kt both hardcode the same default image. SidecarManifestBuilder.DEFAULT_IMAGE is private so Start.kt cannot reference it. Move to Constants. 4) objectTags LOOKS WRONG - EasyDbLabStorageExtension passes the S3 endpoint URL as an object tag (ImmutableMap.of(endpoint, endpoint)). Object tags are S3 metadata tags, not an endpoint configuration mechanism. How does cassandra-analytics actually use this? 5) CREDENTIALS IN CONFIGMAP - cassandra-sidecar.yaml stores username and password in a ConfigMap, not a Secret. ConfigMaps are unencrypted. Move to a K8s Secret. 6) NO UNIT TESTS FOR SidecarManifestBuilder - SidecarServiceTest only checks that applyResource() is called twice but does not assert anything about the resources built. Deserves a dedicated test verifying DaemonSet properties (hostNetwork, nodeSelector, volumes, etc). 7) MINOR: parseReadBuckets with empty string throws a confusing error. Add an empty-string guard. 8) UNCHECKED TASKS in s3-bulk-write-transport/tasks.md (2.3, 2.4, 2.8-2.10, 4.1, 4.2, test tasks) appear implemented but are not checked. Overall: solid direction. The auto-shutdown watchdog is well-tested with the injectable onShutdown lambda. The Pyroscope NODE_NAME label issue is the most critical thing to verify before merging. |
|
PR Review: Bulk write via S3 with replication This is a well-structured PR. The architectural shift from SSH/systemd sidecar management to a K3s DaemonSet approach is clean and aligns with the project's fabric8-first design. The S3 transport extension is well-documented and the test coverage with LocalStack is solid. A few things worth addressing. Medium Issues 1. EasyDbLabStorageExtension.getStorageConfiguration() will NPE if called before initialize() credentialsProvider is uninitialized until initialize() runs. The cassandra-analytics framework might call getStorageConfiguration() before initialize() in some error paths. A guard or lazy init would prevent a confusing NPE. 2. Start.kt silently skips sidecar deploy when control host is absent If getControlHost() returns null, sidecar deployment is silently skipped with no user-facing feedback. The user would see Cassandra start but not know the sidecar was not deployed. Should emit a warning event, or throw if sidecar is always expected. 3. parseReadBuckets does not validate empty cluster IDs or bucket names After splitting on ':', clusterId and bucket can be empty strings (e.g. input ':bucket' or 'dc1:'). These would pass through and cause confusing failures downstream. Add a check after splitting. 4. System properties mutation in EasyDbLabStorageExtensionIntegrationTest is global state System.setProperty is JVM-global. If tests run in parallel, LocalStack credentials could bleed into other tests that resolve AWS credentials. Consider a @ResourceLock annotation. Minor Issues 5. busybox:latest in init container - pin the version Using :latest can cause unexpected behavior if the image is updated. Pin to a specific version (e.g. busybox:1.36). Same applies to the default sidecar image in Start.kt. 6. Endpoint via object tags is non-obvious - add a comment Passing the custom endpoint via object tags is an unusual pattern. A brief comment explaining how cassandra-analytics consumes it would prevent confusion for future contributors. 7. Removed host field from sidecar failure events is a serialization breaking change SidecarStopFailed, SidecarStartFailed, SidecarRestartFailed all lost the host: String field. Since these are @serializable with @SerialName, any in-flight serialized event payloads (Redis, MCP) would fail deserialization. Fine for this tool's use case, just worth noting if Redis event history is consumed anywhere. 8. @Suppress(LongMethod) on buildDaemonSet Suppressing the detekt warning rather than addressing it. The method could be split into buildInitContainerSpec() and buildMainContainerSpec() helpers to remove the suppression. What is done well
|
…emonSet - New spark/bulk-writer-s3 module: bulk writes via S3-staged SSTables using cassandra-analytics S3_COMPAT transport - Cassandra sidecar replaced from packer-built systemd service to K3s DaemonSet; lifecycle (start/stop/restart) now managed via K8s instead of per-node SSH - Server gains --auto-shutdown flag for VPC-based lifecycle management
6f0077c to
c63fc88
Compare
PR Review: Bulk write via S3 with replicationThis is a substantial PR covering three distinct features: containerizing the Cassandra sidecar as a K3s DaemonSet, adding the Mutable image tags in
|
|
Merging this now, even though it's not complete. Moving sidecar to a container cuts 15 minutes off the build process, so it's a win. |
No description provided.