Skip to content

Fixed s3 decoding#1516

Open
subkanthi wants to merge 8 commits intoantalya-26.1from
1338_fix_s3_key_decoding
Open

Fixed s3 decoding#1516
subkanthi wants to merge 8 commits intoantalya-26.1from
1338_fix_s3_key_decoding

Conversation

@subkanthi
Copy link
Copy Markdown
Collaborator

@subkanthi subkanthi commented Mar 12, 2026

closes: #1348

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Use key (parsed without URI decoding) so that percent-encoded characters in object keys are preserved. Fixes issues where there retrieving data returns a 404 when there are slashes in s3 path.

Documentation entry for user-facing changes

Use key (parsed without URI decoding) so that percent-encoded characters in object keys are preserved. Fixes issues where there retrieving data returns a 404 when there are slashes in s3 path.

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 12, 2026

Workflow [PR], commit [5caa8cf]

Signed-off-by: Kanthi Subramanian <subkanthi@gmail.com>
@subkanthi subkanthi force-pushed the 1338_fix_s3_key_decoding branch from b5bde22 to 2908d6d Compare March 12, 2026 15:41
@subkanthi subkanthi marked this pull request as ready for review March 14, 2026 20:45
@ianton-ru
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9bcbbe60be

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// Use key (parsed without URI decoding) so that percent-encoded
// characters in object keys (e.g. %2F in Iceberg partition paths) are preserved.
std::string key_to_use = target_decomposed.key;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Parse object key with S3::URI for HTTP URLs

Deriving key_to_use from target_decomposed.key regresses path-style S3 locations like https://s3.amazonaws.com/<bucket>/<key> (and MinIO-style http(s)://endpoint/<bucket>/<key>): SchemeAuthorityKey leaves the bucket in the key, while bucket selection is already taken from s3_uri.bucket. That makes reads use a bucket-prefixed key against a storage already scoped to that bucket, so objects are looked up under the wrong path and can return 404 for valid files.

Useful? React with 👍 / 👎.

@alsugiliazova
Copy link
Copy Markdown
Member

@alsugiliazova
Copy link
Copy Markdown
Member

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1516 (S3 key decoding in resolveObjectStorageForPath):

Confirmed defects

Medium: Path-style HTTP(S) S3 URLs build an incorrect object key (bucket/key instead of key)

  • Impact: Reads can return 404 for valid objects when metadata paths use path-style URLs (https://endpoint/<bucket>/<key>), because lookup is performed with a bucket-prefixed key against bucket-scoped storage.
  • Anchor: src/Storages/ObjectStorage/Utils.cpp / resolveObjectStorageForPath (key_to_use assignment).
  • Trigger: Iceberg metadata path like https://s3.amazonaws.com/my-bucket/partition=us%2Fwest/file.parquet (or MinIO equivalent), with storage resolved through this branch.
  • Why defect: For path-style URLs, SchemeAuthorityKey keeps <bucket>/<key> in .key, while S3::URI correctly splits bucket and key. Using target_decomposed.key reintroduces the bucket into the key.
  • Smallest logical repro:
    1. Provide absolute path-style URL in metadata: https://<endpoint>/<bucket>/<object>.
    2. Enter resolveObjectStorageForPath S3 branch.
    3. key_to_use = target_decomposed.key becomes <bucket>/<object>.
    4. Storage already scoped to <bucket> receives key <bucket>/<object>, causing wrong lookup path and potential 404.
  • Transition mapping: absolute metadata path -> URI decomposition -> key derivation -> storage/key resolution -> read request; defect occurs at key derivation transition.
  • Logical fault-injection mapping: Inject path-style absolute URL with encoded partition segment (%2F) and bucket-scoped storage; observed behavior is bucket duplication in key.
  • Fix direction (short): Derive key_to_use from S3::URI (s3_uri.key) for HTTP(S) path-style parsing path.
  • Regression test direction (short): Add integration test with explicit path-style URL (https://endpoint/<bucket>/<key-with-%2F>) and assert successful read.
  • Affected subsystem and blast radius: Iceberg/ObjectStorage path resolution for S3-compatible endpoints; impacts reads that rely on path-style absolute URLs in metadata.

Code evidence:

// src/Storages/ObjectStorage/Utils.cpp (PR change)
S3::URI s3_uri(normalized_path);

// Use key (parsed without URI decoding) so that percent-encoded
// characters in object keys (e.g. %2F in Iceberg partition paths) are preserved.
std::string key_to_use = target_decomposed.key;
// src/Storages/ObjectStorage/Utils.cpp
// https://host/bucket/key => authority="host", key="bucket/key"
authority = std::string(rest.substr(0, slash));
key = std::string(rest.substr(++slash));
// src/IO/S3/URI.cpp (path-style extraction)
else if (re2::RE2::PartialMatch(uri.getPath(), path_style_pattern, &bucket, &key))
{
    is_virtual_hosted_style = false;
    endpoint = uri.getScheme() + "://" + uri.getAuthority();
}

Coverage summary

  • Scope reviewed: Full PR delta (src/Storages/ObjectStorage/Utils.cpp, tests/integration/test_database_iceberg/test.py) and dependent URI decomposition/parsing code in SchemeAuthorityKey and S3::URI.
  • Call-graph nodes covered: resolveObjectStorageForPath S3 branch, scheme normalization, base/secondary storage selection, cache-key derivation, endpoint construction, credentials propagation gate, path parsing helpers.
  • Transitions reviewed: entry path classification, URI normalization (s3a/s3n/gcs remap), key derivation, base-storage match, secondary storage creation, request key handoff.
  • Categories failed: URI/key decoding consistency for HTTP(S) path-style S3 URLs.
  • Categories passed: percent-encoding preservation intent (%2F), storage reuse/cache behavior, error-contract consistency for same logical URI style, concurrency/locking safety in touched path, rollback/partial-update safety (no in-function persistent mutation).
  • Fault-category completion matrix:
    • URI parsing and normalization mismatch: Executed -> Fail (1 defect).
    • Encoded character preservation (%2F, %20, query handling interactions): Executed -> Pass (no additional confirmed defect in reviewed path).
    • Storage selection and endpoint equivalence: Executed -> Pass.
    • Credential propagation and cross-endpoint behavior: Executed -> Pass in reviewed logic.
    • Concurrency/interleaving on shared storage cache map: Executed -> Pass (mutex-protected critical section observed in touched flow).
    • Exception/partial-update behavior: Executed -> Pass (no partial persistent updates in this path).
    • C++ bug-type sweep (lifetime, invalidation, race, UB, overflow, leaks): Executed -> Pass for reviewed scope.
  • Assumptions/limits: Static source audit only; no runtime execution against PR branch in this report.

Signed-off-by: Kanthi Subramanian <subkanthi@gmail.com>
@subkanthi subkanthi force-pushed the 1338_fix_s3_key_decoding branch from 29c2450 to e8d4d20 Compare March 26, 2026 14:20
@subkanthi
Copy link
Copy Markdown
Collaborator Author

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1516 (S3 key decoding in resolveObjectStorageForPath):

Confirmed defects

Medium: Path-style HTTP(S) S3 URLs build an incorrect object key (bucket/key instead of key)

  • Impact: Reads can return 404 for valid objects when metadata paths use path-style URLs (https://endpoint/<bucket>/<key>), because lookup is performed with a bucket-prefixed key against bucket-scoped storage.

  • Anchor: src/Storages/ObjectStorage/Utils.cpp / resolveObjectStorageForPath (key_to_use assignment).

  • Trigger: Iceberg metadata path like https://s3.amazonaws.com/my-bucket/partition=us%2Fwest/file.parquet (or MinIO equivalent), with storage resolved through this branch.

  • Why defect: For path-style URLs, SchemeAuthorityKey keeps <bucket>/<key> in .key, while S3::URI correctly splits bucket and key. Using target_decomposed.key reintroduces the bucket into the key.

  • Smallest logical repro:

    1. Provide absolute path-style URL in metadata: https://<endpoint>/<bucket>/<object>.
    2. Enter resolveObjectStorageForPath S3 branch.
    3. key_to_use = target_decomposed.key becomes <bucket>/<object>.
    4. Storage already scoped to <bucket> receives key <bucket>/<object>, causing wrong lookup path and potential 404.
  • Transition mapping: absolute metadata path -> URI decomposition -> key derivation -> storage/key resolution -> read request; defect occurs at key derivation transition.

  • Logical fault-injection mapping: Inject path-style absolute URL with encoded partition segment (%2F) and bucket-scoped storage; observed behavior is bucket duplication in key.

  • Fix direction (short): Derive key_to_use from S3::URI (s3_uri.key) for HTTP(S) path-style parsing path.

  • Regression test direction (short): Add integration test with explicit path-style URL (https://endpoint/<bucket>/<key-with-%2F>) and assert successful read.

  • Affected subsystem and blast radius: Iceberg/ObjectStorage path resolution for S3-compatible endpoints; impacts reads that rely on path-style absolute URLs in metadata.

Code evidence:

// src/Storages/ObjectStorage/Utils.cpp (PR change)
S3::URI s3_uri(normalized_path);

// Use key (parsed without URI decoding) so that percent-encoded
// characters in object keys (e.g. %2F in Iceberg partition paths) are preserved.
std::string key_to_use = target_decomposed.key;
// src/Storages/ObjectStorage/Utils.cpp
// https://host/bucket/key => authority="host", key="bucket/key"
authority = std::string(rest.substr(0, slash));
key = std::string(rest.substr(++slash));
// src/IO/S3/URI.cpp (path-style extraction)
else if (re2::RE2::PartialMatch(uri.getPath(), path_style_pattern, &bucket, &key))
{
    is_virtual_hosted_style = false;
    endpoint = uri.getScheme() + "://" + uri.getAuthority();
}

Coverage summary

  • Scope reviewed: Full PR delta (src/Storages/ObjectStorage/Utils.cpp, tests/integration/test_database_iceberg/test.py) and dependent URI decomposition/parsing code in SchemeAuthorityKey and S3::URI.

  • Call-graph nodes covered: resolveObjectStorageForPath S3 branch, scheme normalization, base/secondary storage selection, cache-key derivation, endpoint construction, credentials propagation gate, path parsing helpers.

  • Transitions reviewed: entry path classification, URI normalization (s3a/s3n/gcs remap), key derivation, base-storage match, secondary storage creation, request key handoff.

  • Categories failed: URI/key decoding consistency for HTTP(S) path-style S3 URLs.

  • Categories passed: percent-encoding preservation intent (%2F), storage reuse/cache behavior, error-contract consistency for same logical URI style, concurrency/locking safety in touched path, rollback/partial-update safety (no in-function persistent mutation).

  • Fault-category completion matrix:

    • URI parsing and normalization mismatch: Executed -> Fail (1 defect).
    • Encoded character preservation (%2F, %20, query handling interactions): Executed -> Pass (no additional confirmed defect in reviewed path).
    • Storage selection and endpoint equivalence: Executed -> Pass.
    • Credential propagation and cross-endpoint behavior: Executed -> Pass in reviewed logic.
    • Concurrency/interleaving on shared storage cache map: Executed -> Pass (mutex-protected critical section observed in touched flow).
    • Exception/partial-update behavior: Executed -> Pass (no partial persistent updates in this path).
    • C++ bug-type sweep (lifetime, invalidation, race, UB, overflow, leaks): Executed -> Pass for reviewed scope.
  • Assumptions/limits: Static source audit only; no runtime execution against PR branch in this report.

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1516 (Fixed s3 decoding):

Confirmed defects:

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: src/Storages/ObjectStorage/Utils.cpp (resolveObjectStorageForPath S3 branch), src/Storages/ObjectStorage/tests/gtest_scheme_authority_key.cpp, and tests/integration/test_database_iceberg/test.py additions in PR diff.
Categories failed: None confirmed.
Categories passed: URI parsing transition correctness (s3/virtual-hosted/path-style), key-derivation invariants (bucket not duplicated; %2F preserved), error-path safety for guarded substr, test coverage for key scenarios, C++ bug-type checks in changed code (lifetime/race/UB/overflow not implicated by this delta).
Assumptions/limits: Static audit only (no full runtime execution of unit/integration suites in this pass).

{
struct ClientFake : DB::S3::Client
{
explicit ClientFake()
Copy link
Copy Markdown
Collaborator Author

@subkanthi subkanthi Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to gtest_readbuffer_s3.cpp

// because the bucket lives in the URL path, not the hostname.
// Strip the bucket prefix so the key is relative to the bucket.
if (!s3_uri.is_virtual_hosted_style
&& key_to_use.starts_with(s3_uri.bucket + "/"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work correctly when path really starts with bucket name? Like s3://bucket/bucket/....?

Copy link
Copy Markdown
Collaborator Author

@subkanthi subkanthi Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this s3://bucket/bucket/.... the s3::URI should set is_virtual_hosted_style to true and this if block should be skipped.
Added gtest https://github.com/Altinity/ClickHouse/pull/1516/changes#diff-a4536876f73836ceeb9f561d2a1ac710fc377262dfbfb9840cd395f948162c10R135

@alsugiliazova
Copy link
Copy Markdown
Member

@alsugiliazova
Copy link
Copy Markdown
Member

alsugiliazova commented Mar 26, 2026

PR #1516 CI Verification Report

CI Results Overview

Category Count
Success 55
Failure 1 (Stateless targeted)
Skipped 40
Total (result_pr.json) 96

PR's New Test Validation

Unit Tests (gtest) — All 5 Passed

All 5 new ResolveObjectStorageForPathKey unit tests passed in the unit tests (asan) job:

  • S3SchemeKeyNoBucketInPath — OK
  • VirtualHostedHttpsKeyNoBucket — OK
  • PathStyleHttpsBucketPrefixStripped — OK
  • PathStyleMinioPreservesPercentEncodedSlash — OK
  • S3SchemePreservesPercentEncodedSlash — OK

Integration Test — Passed in 2 Configurations

The new test_partition_value_with_slash was executed and passed in both integration test configurations:

Job Shard Result
Integration tests (amd_binary, 3/5) 3/5 OK
Integration tests (amd_asan, db disk, old analyzer, 6/6) 6/6 OK

The existing test_table_with_slash also passed in both configurations.

CI Failures

1. Stateless tests (arm_asan, targeted) — Known Flaky, Not PR-Related

Job: Stateless tests (arm_asan, targeted)

Two tests failed:

Test Upstream Status (30 days) Assessment
03321_clickhouse_local_initialization_not_too_slow_even_under_sanitizers 61,842 OK, 2 FAIL Rare flaky (0.003% fail rate) — timing-sensitive under sanitizers
03396_s3_do_not_retry_dns_errors 60,903 OK, 0 FAIL upstream Possibly environment-specific (DNS resolution in CI runner)

Neither test is related to S3 key decoding or Iceberg partition handling. The PR does not modify any stateless test files.

Related to PR: No — Pre-existing flaky tests in targeted runner

2. GrypeScan (-alpine) — CVE in Base Image

CVE-2026-2673 (High) in Alpine base image OpenSSL packages. Same failure across all PRs on antalya-26.1.

Related to PR: No — Base image vulnerability

Conclusion

Verdict: Ready to merge — No PR-related failures.

@subkanthi subkanthi requested a review from ianton-ru March 27, 2026 03:03
@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants