feat: Add S3 bucket configuration for path-style URL support.#48
feat: Add S3 bucket configuration for path-style URL support.#4820001020ycx wants to merge 11 commits intoy-scope:presto-0.293-clp-connectorfrom
Conversation
…pe/velox into 2026-01-22-update-clp
|
code review pending merge of this pr: #47 |
📝 WalkthroughWalkthroughThis change adds S3 bucket configuration support to the ClpPackageS3AuthProvider by introducing a new bucket configuration parameter, modifying S3 URL construction logic to conditionally include the bucket in the path, and updating validation and environment variable handling accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)
32-54: Add validation to prevent empty region values from breaking AWS SigV4 signing.Configuration key
clp.s3-regioncan be explicitly set to an empty string, which overrides the default value and causes an emptyAWS_DEFAULT_REGIONenvironment variable to be exported. AWS SigV4 signing requires a non-empty region in the credential scope; SDKs treat an explicitly empty region as invalid and do not fall back to other region providers, causing authentication failures.Add a
VELOX_CHECKto reject empty regions, consistent with existing validation foraccessKeyIdandsecretAccessKey:Proposed fix
bucket_ = config_->get<std::string>(kBucket, ""); region_ = config_->get<std::string>(kRegion, kDefaultRegion); + VELOX_CHECK(!region_.empty(), fmt::format("{} cannot be empty", kRegion));
🤖 Fix all issues with AI agents
In `@CMake/resolve_dependency_modules/clp.cmake`:
- Around line 16-20: The FetchContent_Declare entry pulling clp at GIT_TAG
v0.8.0 may introduce breaking changes; verify compatibility by running connector
integration and unit tests against clp v0.8.0 and inspect codepaths that build
or parse CLP queries/configs (look for uses of KQL date(), any code expecting
database.name, and timestamp literal handling). If incompatible, either pin to a
compatible clp tag or update the connector: replace KQL date() usages with
timestamp(), migrate any code that reads/writes database.name to handle
database.names as a dictionary, and adjust parsing/validation logic that assumed
timestamp literals match string columns. Finally, update tests and CI matrix and
document the chosen clp version in the dependency declaration
(FetchContent_Declare for clp).
In `@velox/connectors/clp/tests/ClpConfigTest.cpp`:
- Around line 249-289: The file fails CI due to clang-format style changes; run
your project's clang-format (or run `clang-format -i`) on
velox/connectors/clp/tests/ClpConfigTest.cpp and commit the reformatted file so
the tests like TEST_F(ClpPackageS3AuthProviderTest, constructS3UrlWithBucket)
and constructS3UrlWithoutBucket, and helper usages such as
buildClpPackageS3AuthProvider and VELOX_CHECK_EQ are preserved but correctly
formatted according to the repository style; ensure no functional changes are
made and only formatting is updated before pushing.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp`:
- Around line 39-44: The check currently rejects any endpoint containing
"amazonaws.com" when clp.s3-bucket is set; instead parse the endpoint hostname
from endPoint_ and only error when the hostname is virtual-hosted (i.e., the
hostname starts with bucket_ + "." and contains "amazonaws.com"), leaving
path-style endpoints like "s3.us-east-1.amazonaws.com" allowed. Update the
VELOX_CHECK condition to parse the host from endPoint_ and replace the simple
substring test with a check that detects hostnames starting with bucket_ + "."
and containing "amazonaws.com" (still referencing bucket_, endPoint_, kBucket
and the VELOX_CHECK call).
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)
53-86: Address clang-format pipeline failure.The pipeline reports clang-format changes are required. Please run
clang-formaton this file and commit the formatted changes.The logic itself is correct:
- Endpoint is retrieved and validated for non-emptiness with trailing slash removal.
- Bucket validation correctly prevents setting
bucket_when using virtual-hosted style endpoints.- Environment variable handling properly unsets
AWS_SESSION_TOKENwhen empty.#!/bin/bash # Check the clang-format diff for this file clang-format --dry-run --Werror velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp 2>&1 || echo "Formatting changes required"
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp`:
- Around line 23-41: The isAwsVirtualHostedStyleEndpoint function currently
requires a scheme and the literal ".s3." which misses valid virtual-hosted
endpoints and scheme-less URIs; update it to robustly extract the hostname
(strip optional "scheme://" and optional ":port"), then split the hostname by
'.' and classify as virtual-hosted if the first label is NOT "s3" and does NOT
start with "s3-" (e.g., "s3-accelerate"), but still ensure the endpoint contains
"amazonaws.com" (or relevant AWS host suffix) before returning true; modify
isAwsVirtualHostedStyleEndpoint accordingly and ensure constructS3Url uses this
improved function to avoid appending the bucket twice.
| namespace { | ||
| // Detects AWS S3 virtual-hosted style endpoints where bucket is in the | ||
| // hostname. Virtual-hosted: https://<bucket>.s3.<region>.amazonaws.com (returns | ||
| // true) Path-style: https://s3.<region>.amazonaws.com (returns false) | ||
| bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) { | ||
| if (endpoint.find("amazonaws.com") == std::string::npos) { | ||
| return false; | ||
| } | ||
|
|
||
| auto schemeEnd = endpoint.find("://"); | ||
| if (schemeEnd == std::string::npos) { | ||
| return false; | ||
| } | ||
|
|
||
| // Virtual-hosted has ".s3." in hostname (bucket.s3.region.amazonaws.com) | ||
| // Path-style starts with "s3." (s3.region.amazonaws.com) | ||
| auto hostPortion = endpoint.substr(schemeEnd + 3); | ||
| return hostPortion.find(".s3.") != std::string::npos; | ||
| } |
There was a problem hiding this comment.
Strengthen virtual-hosted detection to avoid false negatives.
The current heuristic requires a scheme and a literal “.s3.” in the host. That misses valid AWS virtual‑hosted endpoints like bucket.s3-accelerate.amazonaws.com and any endpoint configured without a scheme. In those cases bucket_ is incorrectly allowed and constructS3Url will append the bucket again, producing a broken URL. Consider parsing the host with optional scheme and classifying virtual‑hosted when the first label is not s3/s3-*.
🔧 Suggested fix
-bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) {
- if (endpoint.find("amazonaws.com") == std::string::npos) {
- return false;
- }
-
- auto schemeEnd = endpoint.find("://");
- if (schemeEnd == std::string::npos) {
- return false;
- }
-
- // Virtual-hosted has ".s3." in hostname (bucket.s3.region.amazonaws.com)
- // Path-style starts with "s3." (s3.region.amazonaws.com)
- auto hostPortion = endpoint.substr(schemeEnd + 3);
- return hostPortion.find(".s3.") != std::string::npos;
-}
+bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) {
+ // Allow endpoints with or without a scheme.
+ size_t hostStart = endpoint.find("://");
+ hostStart = (hostStart == std::string::npos) ? 0 : hostStart + 3;
+ size_t hostEnd = endpoint.find('/', hostStart);
+ auto hostPort = endpoint.substr(hostStart, hostEnd - hostStart);
+ auto host = hostPort.substr(0, hostPort.find(':'));
+
+ if (host.find("amazonaws.com") == std::string::npos) {
+ return false;
+ }
+
+ auto firstDot = host.find('.');
+ if (firstDot == std::string::npos) {
+ return false;
+ }
+ auto firstLabel = host.substr(0, firstDot);
+ return firstLabel.rfind("s3", 0) != 0;
+}🤖 Prompt for AI Agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp` around lines 23
- 41, The isAwsVirtualHostedStyleEndpoint function currently requires a scheme
and the literal ".s3." which misses valid virtual-hosted endpoints and
scheme-less URIs; update it to robustly extract the hostname (strip optional
"scheme://" and optional ":port"), then split the hostname by '.' and classify
as virtual-hosted if the first label is NOT "s3" and does NOT start with "s3-"
(e.g., "s3-accelerate"), but still ensure the endpoint contains "amazonaws.com"
(or relevant AWS host suffix) before returning true; modify
isAwsVirtualHostedStyleEndpoint accordingly and ensure constructS3Url uses this
improved function to avoid appending the bucket twice.
This reverts commit 6476f4f.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)
35-59: Re-checkendPoint_after trimming trailing slashes.
If the config is/(or all slashes),pop_back()makes it empty and later URL construction produces a broken path. Trim all trailing slashes and validate again.🛠️ Proposed fix
- if ('/' == endPoint_.back()) { - endPoint_.pop_back(); - } + while (!endPoint_.empty() && '/' == endPoint_.back()) { + endPoint_.pop_back(); + } + VELOX_CHECK(!endPoint_.empty(), fmt::format("{} cannot be empty", kEndPoint));
gibber9809
left a comment
There was a problem hiding this comment.
This PR LGTM. PR title as well.
I think it's worth coming back to how the actual credentials are configured + passed on to clp-s to fit better with the rest of the package, but that can be improved in a different PR.
Description
Background: S3 Support Before and After CLP-S v0.8.0 Update
Before CLP-S v0.8.0 Update: AWS S3 Only
The old CLP-S library only supported AWS S3 URLs with the following constraints:
https://onlyamazonaws.comonly (hardcoded validation)Supported URL formats:
https://logs.s3.us-east-1.amazonaws.com/archives/default/abc123
https://s3.us-east-1.amazonaws.com/logs/archives/default/abc123
Rejected URLs:
http://172.26.105.44:9000/logs/archives/default/abc123 ❌ (not amazonaws.com)
http://localhost:4566/logs/archives/default/abc123 ❌ (not https)
After CLP-S v0.8.0 Update: S3-Compatible Storage Support
The updated CLP-S library now supports any S3-compatible storage:
http://orhttps://Now supported URL formats:
https://logs.s3.us-east-1.amazonaws.com/archives/default/abc123 ✅ (AWS S3)
http://172.26.105.44:9000/logs/archives/default/abc123 ✅ (MinIO)
http://localhost:4566/logs/archives/default/abc123 ✅ (LocalStack)
Why This Change Is Needed
The Velox CLP connector constructs S3 URLs by concatenating the endpoint with the split path:
Before this PR:
URL = endpoint + "/" + splitPath
= "http://172.26.105.44:9000" + "/" + "archives/default/abc123"
= "http://172.26.105.44:9000/archives/default/abc123" ❌ Missing bucket!
CLP-S expects the URL to include the bucket name for path-style URLs. Without the bucket, the URL cannot be parsed correctly.
After this PR (explicit bucket configuration):
clp.s3-end-point=http://172.26.105.44:9000
clp.s3-bucket=logs
URL = endpoint + "/" + bucket + "/" + splitPath
= "http://172.26.105.44:9000" + "/" + "logs" + "/" + "archives/default/abc123"
= "http://172.26.105.44:9000/logs/archives/default/abc123" ✅
Checklist
breaking change.
Validation performed
All unit tests passed.
End to end test
SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100CLP package's clp-config.yaml:
Velox's clp.properties:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
✏️ Tip: You can customize this high-level summary in your review settings.