Skip to content

feat: Add S3 bucket configuration for path-style URL support.#48

Open
20001020ycx wants to merge 11 commits intoy-scope:presto-0.293-clp-connectorfrom
20001020ycx:2026-01-26-s3-support
Open

feat: Add S3 bucket configuration for path-style URL support.#48
20001020ycx wants to merge 11 commits intoy-scope:presto-0.293-clp-connectorfrom
20001020ycx:2026-01-26-s3-support

Conversation

@20001020ycx
Copy link

@20001020ycx 20001020ycx commented Jan 27, 2026

Description

Background: S3 Support Before and After CLP-S v0.8.0 Update

Before CLP-S v0.8.0 Update: AWS S3 Only

The old CLP-S library only supported AWS S3 URLs with the following constraints:

  • Protocol: https:// only
  • Endpoint: amazonaws.com only (hardcoded validation)

Supported URL formats:
https://logs.s3.us-east-1.amazonaws.com/archives/default/abc123
https://s3.us-east-1.amazonaws.com/logs/archives/default/abc123

Rejected URLs:
http://172.26.105.44:9000/logs/archives/default/abc123 ❌ (not amazonaws.com)
http://localhost:4566/logs/archives/default/abc123 ❌ (not https)

After CLP-S v0.8.0 Update: S3-Compatible Storage Support

The updated CLP-S library now supports any S3-compatible storage:

  • Protocol: http:// or https://
  • Endpoint: Any hostname with optional port

Now supported URL formats:
https://logs.s3.us-east-1.amazonaws.com/archives/default/abc123 ✅ (AWS S3)
http://172.26.105.44:9000/logs/archives/default/abc123 ✅ (MinIO)
http://localhost:4566/logs/archives/default/abc123 ✅ (LocalStack)

Why This Change Is Needed

The Velox CLP connector constructs S3 URLs by concatenating the endpoint with the split path:

Before this PR:
URL = endpoint + "/" + splitPath
= "http://172.26.105.44:9000" + "/" + "archives/default/abc123"
= "http://172.26.105.44:9000/archives/default/abc123" ❌ Missing bucket!

CLP-S expects the URL to include the bucket name for path-style URLs. Without the bucket, the URL cannot be parsed correctly.

After this PR (explicit bucket configuration):
clp.s3-end-point=http://172.26.105.44:9000
clp.s3-bucket=logs

URL = endpoint + "/" + bucket + "/" + splitPath
= "http://172.26.105.44:9000" + "/" + "logs" + "/" + "archives/default/abc123"
= "http://172.26.105.44:9000/logs/archives/default/abc123" ✅

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

  • All unit tests passed.

  • End to end test

    • Query: SELECT CLP_GET_JSON_STRING() from clp.default.default limit 100
    • Set up: official clp package v0.8.0, OSS version of Presto, velox with changes in this PR, and a locally set minio. The related configuration is shown below.
    • The result is returned as expected.

CLP package's clp-config.yaml:

package:
  storage_engine: "clp-s"
  query_engine: "presto"

archive_output:
  storage:
    # type: "fs"
    type: "s3"
    staging_directory: "var/data/staged-archives"
    s3_config:
      region_code: "us-east-1"
      endpoint_url: "http://172.26.105.44:9000"
      bucket: "logs"
      key_prefix: "archives/"
      aws_authentication:
        type: "credentials"
        credentials:
          access_key_id: "minioadmin"
          secret_access_key: "minioadmin"

Velox's clp.properties:

  connector.name=clp                                                                                                                                                                                                             
  clp.storage-type=s3                                                                                                                                                                                                            
  clp.s3-auth-provider=clp_package                                                                                                                                                                                               
  clp.s3-end-point=http://172.26.105.44:9000                                                                                                                                                                                     
  clp.s3-bucket=logs                                                                                                                                                                                                
  clp.s3-access-key-id=minioadmin                                                                                                                                                                                                
  clp.s3-secret-access-key=minioadmin   

Summary by CodeRabbit

Release Notes

  • New Features

    • Added S3 bucket configuration support to enable more flexible storage endpoint configuration.
    • Enhanced S3 URL construction to support both path-style and virtual-hosted-style endpoints.
  • Bug Fixes

    • Improved endpoint validation and URL formatting for S3-compatible storage systems.

✏️ Tip: You can customize this high-level summary in your review settings.

@20001020ycx
Copy link
Author

code review pending merge of this pr: #47

@coderabbitai
Copy link

coderabbitai bot commented Jan 27, 2026

📝 Walkthrough

Walkthrough

This change adds S3 bucket configuration support to the ClpPackageS3AuthProvider by introducing a new bucket configuration parameter, modifying S3 URL construction logic to conditionally include the bucket in the path, and updating validation and environment variable handling accordingly.

Changes

Cohort / File(s) Summary
S3 Auth Provider Header Updates
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.h
Adds public constant kBucket and private data member bucket_ to store configured S3 bucket name.
S3 Auth Provider Implementation Updates
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp
Modifies constructS3Url() to branch on empty bucket and format URLs as endPoint_/splitPath or endPoint_/bucket_/splitPath. Updates exportAuthEnvironmentVariables() to retrieve and validate endPoint_ from config, trim trailing slashes, retrieve bucket from config, and explicitly unset session token when empty.
Test Suite Updates
velox/connectors/clp/tests/ClpConfigTest.cpp
Adds tests for bucket configuration support and introduces three new URL construction test cases covering path-style and virtual-hosted-style S3 endpoints with and without bucket parameters.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding S3 bucket configuration to support path-style URL construction.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)

32-54: Add validation to prevent empty region values from breaking AWS SigV4 signing.

Configuration key clp.s3-region can be explicitly set to an empty string, which overrides the default value and causes an empty AWS_DEFAULT_REGION environment variable to be exported. AWS SigV4 signing requires a non-empty region in the credential scope; SDKs treat an explicitly empty region as invalid and do not fall back to other region providers, causing authentication failures.

Add a VELOX_CHECK to reject empty regions, consistent with existing validation for accessKeyId and secretAccessKey:

Proposed fix
   bucket_ = config_->get<std::string>(kBucket, "");
   region_ = config_->get<std::string>(kRegion, kDefaultRegion);
+  VELOX_CHECK(!region_.empty(), fmt::format("{} cannot be empty", kRegion));
🤖 Fix all issues with AI agents
In `@CMake/resolve_dependency_modules/clp.cmake`:
- Around line 16-20: The FetchContent_Declare entry pulling clp at GIT_TAG
v0.8.0 may introduce breaking changes; verify compatibility by running connector
integration and unit tests against clp v0.8.0 and inspect codepaths that build
or parse CLP queries/configs (look for uses of KQL date(), any code expecting
database.name, and timestamp literal handling). If incompatible, either pin to a
compatible clp tag or update the connector: replace KQL date() usages with
timestamp(), migrate any code that reads/writes database.name to handle
database.names as a dictionary, and adjust parsing/validation logic that assumed
timestamp literals match string columns. Finally, update tests and CI matrix and
document the chosen clp version in the dependency declaration
(FetchContent_Declare for clp).

In `@velox/connectors/clp/tests/ClpConfigTest.cpp`:
- Around line 249-289: The file fails CI due to clang-format style changes; run
your project's clang-format (or run `clang-format -i`) on
velox/connectors/clp/tests/ClpConfigTest.cpp and commit the reformatted file so
the tests like TEST_F(ClpPackageS3AuthProviderTest, constructS3UrlWithBucket)
and constructS3UrlWithoutBucket, and helper usages such as
buildClpPackageS3AuthProvider and VELOX_CHECK_EQ are preserved but correctly
formatted according to the repository style; ensure no functional changes are
made and only formatting is updated before pushing.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp`:
- Around line 39-44: The check currently rejects any endpoint containing
"amazonaws.com" when clp.s3-bucket is set; instead parse the endpoint hostname
from endPoint_ and only error when the hostname is virtual-hosted (i.e., the
hostname starts with bucket_ + "." and contains "amazonaws.com"), leaving
path-style endpoints like "s3.us-east-1.amazonaws.com" allowed. Update the
VELOX_CHECK condition to parse the host from endPoint_ and replace the simple
substring test with a check that detects hostnames starting with bucket_ + "."
and containing "amazonaws.com" (still referencing bucket_, endPoint_, kBucket
and the VELOX_CHECK call).

@20001020ycx 20001020ycx changed the title feat: Add S3 bucket and region configuration for S3-compatible storage support. feat: Add S3 bucket configuration for path-style URL support. Jan 27, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)

53-86: Address clang-format pipeline failure.

The pipeline reports clang-format changes are required. Please run clang-format on this file and commit the formatted changes.

The logic itself is correct:

  • Endpoint is retrieved and validated for non-emptiness with trailing slash removal.
  • Bucket validation correctly prevents setting bucket_ when using virtual-hosted style endpoints.
  • Environment variable handling properly unsets AWS_SESSION_TOKEN when empty.
#!/bin/bash
# Check the clang-format diff for this file
clang-format --dry-run --Werror velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp 2>&1 || echo "Formatting changes required"

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp`:
- Around line 23-41: The isAwsVirtualHostedStyleEndpoint function currently
requires a scheme and the literal ".s3." which misses valid virtual-hosted
endpoints and scheme-less URIs; update it to robustly extract the hostname
(strip optional "scheme://" and optional ":port"), then split the hostname by
'.' and classify as virtual-hosted if the first label is NOT "s3" and does NOT
start with "s3-" (e.g., "s3-accelerate"), but still ensure the endpoint contains
"amazonaws.com" (or relevant AWS host suffix) before returning true; modify
isAwsVirtualHostedStyleEndpoint accordingly and ensure constructS3Url uses this
improved function to avoid appending the bucket twice.

Comment on lines 23 to 41
namespace {
// Detects AWS S3 virtual-hosted style endpoints where bucket is in the
// hostname. Virtual-hosted: https://<bucket>.s3.<region>.amazonaws.com (returns
// true) Path-style: https://s3.<region>.amazonaws.com (returns false)
bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) {
if (endpoint.find("amazonaws.com") == std::string::npos) {
return false;
}

auto schemeEnd = endpoint.find("://");
if (schemeEnd == std::string::npos) {
return false;
}

// Virtual-hosted has ".s3." in hostname (bucket.s3.region.amazonaws.com)
// Path-style starts with "s3." (s3.region.amazonaws.com)
auto hostPortion = endpoint.substr(schemeEnd + 3);
return hostPortion.find(".s3.") != std::string::npos;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Strengthen virtual-hosted detection to avoid false negatives.

The current heuristic requires a scheme and a literal “.s3.” in the host. That misses valid AWS virtual‑hosted endpoints like bucket.s3-accelerate.amazonaws.com and any endpoint configured without a scheme. In those cases bucket_ is incorrectly allowed and constructS3Url will append the bucket again, producing a broken URL. Consider parsing the host with optional scheme and classifying virtual‑hosted when the first label is not s3/s3-*.

🔧 Suggested fix
-bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) {
-  if (endpoint.find("amazonaws.com") == std::string::npos) {
-    return false;
-  }
-
-  auto schemeEnd = endpoint.find("://");
-  if (schemeEnd == std::string::npos) {
-    return false;
-  }
-
-  // Virtual-hosted has ".s3." in hostname (bucket.s3.region.amazonaws.com)
-  // Path-style starts with "s3." (s3.region.amazonaws.com)
-  auto hostPortion = endpoint.substr(schemeEnd + 3);
-  return hostPortion.find(".s3.") != std::string::npos;
-}
+bool isAwsVirtualHostedStyleEndpoint(const std::string& endpoint) {
+  // Allow endpoints with or without a scheme.
+  size_t hostStart = endpoint.find("://");
+  hostStart = (hostStart == std::string::npos) ? 0 : hostStart + 3;
+  size_t hostEnd = endpoint.find('/', hostStart);
+  auto hostPort = endpoint.substr(hostStart, hostEnd - hostStart);
+  auto host = hostPort.substr(0, hostPort.find(':'));
+
+  if (host.find("amazonaws.com") == std::string::npos) {
+    return false;
+  }
+
+  auto firstDot = host.find('.');
+  if (firstDot == std::string::npos) {
+    return false;
+  }
+  auto firstLabel = host.substr(0, firstDot);
+  return firstLabel.rfind("s3", 0) != 0;
+}
🤖 Prompt for AI Agents
In `@velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp` around lines 23
- 41, The isAwsVirtualHostedStyleEndpoint function currently requires a scheme
and the literal ".s3." which misses valid virtual-hosted endpoints and
scheme-less URIs; update it to robustly extract the hostname (strip optional
"scheme://" and optional ":port"), then split the hostname by '.' and classify
as virtual-hosted if the first label is NOT "s3" and does NOT start with "s3-"
(e.g., "s3-accelerate"), but still ensure the endpoint contains "amazonaws.com"
(or relevant AWS host suffix) before returning true; modify
isAwsVirtualHostedStyleEndpoint accordingly and ensure constructS3Url uses this
improved function to avoid appending the bucket twice.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
velox/connectors/clp/search_lib/ClpPackageS3AuthProvider.cpp (1)

35-59: Re-check endPoint_ after trimming trailing slashes.
If the config is / (or all slashes), pop_back() makes it empty and later URL construction produces a broken path. Trim all trailing slashes and validate again.

🛠️ Proposed fix
-  if ('/' == endPoint_.back()) {
-    endPoint_.pop_back();
-  }
+  while (!endPoint_.empty() && '/' == endPoint_.back()) {
+    endPoint_.pop_back();
+  }
+  VELOX_CHECK(!endPoint_.empty(), fmt::format("{} cannot be empty", kEndPoint));

Copy link

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM. PR title as well.

I think it's worth coming back to how the actual credentials are configured + passed on to clp-s to fit better with the rest of the package, but that can be improved in a different PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants