Skip to content

Commit 781b702

Browse files
HJLebbinkTest User
authored andcommitted
added tables support
1 parent 65f88e8 commit 781b702

File tree

242 files changed

+53107
-18
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

242 files changed

+53107
-18
lines changed

ABSOLUTE_REQUIREMENTS_CHECKLIST.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Absolute Requirements Checklist
2+
3+
This document serves as a verification checklist for hard requirements that MUST be followed. Violations are unacceptable.
4+
5+
## Level 1: Code Review Checkpoints (Before Writing)
6+
7+
When tasked with writing benchmark, measurement, or comparison code:
8+
9+
- [ ] **Ask yourself**: "Am I measuring actual system behavior or simulating assumptions?"
10+
- [ ] **Ask yourself**: "Could this code mislead someone about what a system actually does?"
11+
- [ ] **Ask yourself**: "If I can't measure it right now, should this code exist at all?"
12+
13+
If any answer is concerning, STOP and clarify with the user before proceeding.
14+
15+
## Level 2: Code Red Flags (During Writing)
16+
17+
Immediately REJECT code that contains:
18+
19+
- [ ] Comments containing "In real scenario" or "For now we use"
20+
- [ ] Comments containing "We'd measure" or "would call"
21+
- [ ] Variables named `expected_*`, `assumed_*`, or `hardcoded_*`
22+
- [ ] Parameters like `expected_bytes` being used in measurement output
23+
- [ ] Hardcoded values passed through to CSV/results as "measured"
24+
- [ ] Simulated responses instead of actual HTTP responses
25+
- [ ] Predetermined result values instead of measuring from real operations
26+
27+
## Level 3: Commit-Time Verification (Before Committing)
28+
29+
Before any commit, search the code for these patterns:
30+
31+
```bash
32+
# Search for these patterns - if found, DO NOT COMMIT
33+
grep -r "expected_bytes" examples/
34+
grep -r "In real scenario" examples/
35+
grep -r "For now we" examples/
36+
grep -r "We'd measure" examples/
37+
grep -r "assume" examples/datafusion/
38+
```
39+
40+
If any matches are found:
41+
1. DO NOT COMMIT
42+
2. Rewrite the code to measure actual behavior
43+
3. Or explicitly label it as "SIMULATION - NOT MEASURED"
44+
45+
## Level 4: Documentation Verification (Before Release)
46+
47+
- [ ] Benchmark documentation clearly states what is MEASURED vs SIMULATED
48+
- [ ] CSV output only contains data that was actually collected
49+
- [ ] Comments do not claim measured results for simulated data
50+
- [ ] Changelog notes if switching from simulation to real measurement
51+
- [ ] README documents any known limitations in measurement
52+
53+
## Level 5: User Communication (After Discovery of Issues)
54+
55+
If assumption-based code is discovered:
56+
57+
- [ ] Immediately notify user that results were simulated
58+
- [ ] Identify specifically which measurements were assumed vs measured
59+
- [ ] Provide corrected measurements if available
60+
- [ ] Update all documentation to reflect reality
61+
- [ ] Create issue for fixing the code to measure properly
62+
63+
## How to Apply This Checklist
64+
65+
### Example: Benchmark Code Review
66+
67+
**SCENARIO**: Code contains this:
68+
```rust
69+
// In real scenario, we'd measure actual bytes from plan_table_scan response
70+
// For now, we use expected values
71+
let bytes_transferred = (expected_bytes * 1024.0 * 1024.0) as u64;
72+
```
73+
74+
**CHECKLIST APPLICATION**:
75+
- [ ] Level 1: FAILED - This IS simulating, not measuring
76+
- [ ] Level 2: FAILED - Contains "In real scenario" and "For now"
77+
- [ ] **ACTION**: Rewrite to measure actual response
78+
79+
**CORRECTED CODE**:
80+
```rust
81+
// Actually measure what was transferred
82+
let response = client.get_object(bucket, object).await?;
83+
let actual_bytes = response.content_length()
84+
.ok_or("Cannot determine transfer size")?;
85+
// Now this is MEASURED
86+
```
87+
88+
### Example: Documentation Review
89+
90+
**SCENARIO**: Documentation states:
91+
> "Both backends achieve 97% data reduction with pushdown filtering"
92+
93+
**CHECKLIST APPLICATION**:
94+
- [ ] Level 4: FAILED - Is this measured or assumed?
95+
- [ ] Check: Did we actually submit filter expressions to Garage?
96+
- [ ] Check: Did we verify Garage returned filtered vs full data?
97+
- [ ] If NO: Update documentation to be truthful
98+
99+
**CORRECTED DOCUMENTATION**:
100+
> "MinIO achieves 97% data reduction via plan_table_scan() API.
101+
> Garage behavior with filters was not tested in this benchmark."
102+
103+
## The Core Question
104+
105+
**Before committing ANY benchmark or measurement code, answer this:**
106+
107+
> "If someone asks me 'Did you actually measure this?', can I say YES without qualification?"
108+
109+
If the answer is NO or MAYBE, the code is not ready to commit.
110+
111+
## Accountability
112+
113+
These requirements exist because:
114+
1. **Data integrity** - Measurements must reflect reality
115+
2. **User trust** - Users rely on benchmarks to make decisions
116+
3. **Engineering quality** - Wasted effort on phantom capabilities
117+
4. **Professional responsibility** - We don't misrepresent what systems do
118+
119+
Violations are not "style issues" - they are failures to meet professional standards.
120+
121+
## Enforcement
122+
123+
- Code that violates these rules will be rejected in review
124+
- Misleading measurements in documentation will be corrected
125+
- If you discover you wrote assumption-based code: Fix it immediately
126+
- If you discover assumption-based code from others: Flag it immediately
127+
128+
There are no exceptions to these requirements.

CLAUDE.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Claude Code Style Guide for MinIO Rust SDK
22

3+
⚠️ **CRITICAL WARNING**: Do NOT commit to git without explicit user approval. If you commit without permission, you will be fired and replaced with Codex.
4+
35
- Only provide actionable feedback.
46
- Exclude code style comments on generated files. These will have a header signifying that.
57
- Do not use emojis.
@@ -21,6 +23,99 @@ Rules:
2123

2224
**Violation of this rule is lying and completely unacceptable.**
2325

26+
## CRITICAL: No Assumption-Based Code in Benchmarks or Measurements
27+
28+
**ABSOLUTE REQUIREMENT: Code that uses predetermined values, hardcoded assumptions, or expected parameters to simulate actual measurements is FORBIDDEN.**
29+
30+
### The Rule
31+
32+
When writing benchmark or measurement code:
33+
34+
1. **Measure ACTUAL results** - Not "expected values"
35+
- WRONG: `let bytes_transferred = expected_bytes * 1024 * 1024` (hardcoded assumption)
36+
- RIGHT: `let actual_bytes = response.content_length()?` (measure from real response)
37+
38+
2. **Never use comments like "In real scenario we'd measure..."**
39+
- This is admission that the code is simulating, not measuring
40+
- Comments saying "for now we use expected values" = assumption-based code
41+
- If you can't measure it, don't ship it as if it were measured
42+
43+
3. **Distinguish what is actually measured vs. what is theoretical**
44+
- Measure: HTTP response headers, actual data transferred, real timing via `Instant::now()`
45+
- Don't measure: Pre-supplied "expected" values, hardcoded data sizes, theoretical results
46+
47+
4. **If backend capability is unknown, test it properly**
48+
- Don't assume both backends behave identically
49+
- Actually invoke backend APIs with real filter expressions
50+
- Check if the backend's response differs from the full object
51+
- Verify the backend actually returned filtered data vs. full data
52+
53+
5. **Code review requirement: Search for these red flags**
54+
- Comments containing "expected_", "assumed_", "hardcoded"
55+
- Comments containing "In real scenario", "For now", "We'd measure"
56+
- Variables named `expected_*` being used in output data
57+
- Parameters like `expected_bytes` passed through and output as "measured"
58+
59+
### Example of the Problem (from real_pushdown_benchmark.rs)
60+
61+
WRONG - This is what happened:
62+
```rust
63+
// Line 355-357
64+
// In real scenario, we'd measure actual bytes from plan_table_scan response
65+
// For now, we use expected values
66+
let bytes_transferred = (expected_bytes * 1024.0 * 1024.0) as u64;
67+
```
68+
69+
This made Garage appear to have the same filtering capability as MinIO when it actually doesn't, because:
70+
- Both got the same `expected_bytes` parameter (30MB for WITH_PUSHDOWN, 1000MB for WITHOUT_PUSHDOWN)
71+
- The CSV output showed identical "measured" data reduction (97%) for both
72+
- But Garage never actually submitted filter expressions or returned filtered data
73+
- It was just the pre-supplied assumption printed to CSV as if measured
74+
75+
RIGHT - What should have been done:
76+
```rust
77+
// Actual approach:
78+
// 1. Build filter expression and send to backend API
79+
// 2. Measure response Content-Length header
80+
// 3. Compare what backend actually returned
81+
let filter_expr = create_filter_expression(/* ... */);
82+
let response = client.submit_filter_request(filter_expr).await?;
83+
let actual_bytes_transferred = response.content_length()
84+
.ok_or("Cannot determine actual transfer size")?;
85+
// Now you KNOW what the backend actually did
86+
```
87+
88+
### Why This Matters
89+
90+
Assumption-based code creates:
91+
1. **False claims about capability** - Looks like Garage supports pushdown when it doesn't
92+
2. **Documentation that is misleading** - CSV output suggested equivalent behavior
93+
3. **Wasted engineering effort** - Chasing phantom capabilities that don't exist
94+
4. **Loss of trust** - Users rely on measurements being real
95+
96+
### How to Remember This Requirement
97+
98+
**When you see a comment in benchmark code saying "In real scenario" or "For now", STOP and ask:**
99+
- Am I actually measuring the system behavior?
100+
- Or am I simulating what I think should happen?
101+
- Could this mislead someone about backend capabilities?
102+
- Would the user expect this to be measured, not assumed?
103+
104+
**If any answer is "yes to the wrong option", rewrite it to measure reality.**
105+
106+
## CRITICAL: Benchmark Requests
107+
108+
**When user asks for a new benchmark, ALWAYS RUN NEW BENCHMARKS - NEVER RECYCLE OLD PERFORMANCE DATA.**
109+
110+
Rules:
111+
1. **Every benchmark request means a fresh run** - Do not reference data from previous benchmark runs
112+
2. **Do not use cached or old results** - Even if similar benchmarks exist, run new ones
113+
3. **Measure current state** - Performance may have changed due to code modifications
114+
4. **Each benchmark is independent** - Do not mix data from different runs or time periods
115+
5. **Always execute** - If a live server is needed and unavailable, state that explicitly instead of using old data
116+
117+
Violation: Presenting old benchmark data as current measurements is misleading and violates the benchmark data integrity rules above.
118+
24119
## Copyright Header
25120

26121
All source files that haven't been generated MUST include the following copyright header:
@@ -192,6 +287,35 @@ Complex distributed systems code must remain **human-readable**:
192287

193288
## Testing Requirements
194289

290+
### Test Quality Standards
291+
292+
**ONLY meaningful tests are appreciated.** Do NOT create trivial or fake tests that:
293+
- Just check if something can be instantiated (e.g., `assert_eq!(schema.fields().len(), 5)`)
294+
- Print logging statements and then `assert!(true, "...")` with no real validation
295+
- Don't actually test functionality or integration behavior
296+
- Artificially inflate test count without proving anything works
297+
- Claim to test "integration" but don't involve any real integration
298+
299+
**Test Logging Rule: Silent on Success, Verbose on Failure**
300+
- Tests should NOT output logging when everything passes (clean test output)
301+
- Only add logging if a test is FAILING or DEBUGGING to help diagnose the issue
302+
- Tests that pass should be silent - no `log::info!()` calls for successful assertions
303+
- This keeps test output clean and prevents noise
304+
305+
**Test Variable Typing: Always Use Explicit Types**
306+
- All variables in tests MUST have explicit type annotations
307+
- WRONG: `let expr = col("country").eq(lit("USA"));`
308+
- RIGHT: `let expr: Expr = col("country").eq(lit("USA"));`
309+
- This makes test code self-documenting and catches type mismatches early
310+
- Type annotations clarify what data flows through the test
311+
312+
Examples of nonsense tests to NEVER create:
313+
- `test_pushdown_integration_summary()` - Just prints documentation, asserts true
314+
- Tests that only log messages without any assertions or validation
315+
- Tests that check boilerplate code exists but don't test actual behavior
316+
317+
Every test must prove something meaningful about the system works correctly.
318+
195319
### Why Unit Tests Are Mandatory
196320

197321
Unit tests are **non-negotiable** in this project for critical business reasons:

Cargo.toml

Lines changed: 65 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@ readme = "README.md"
1010
keywords = ["object-storage", "minio", "s3"]
1111
categories = ["api-bindings", "web-programming::http-client"]
1212

13+
[package.metadata.docs.rs]
14+
features = ["datafusion", "puffin-compression"]
15+
1316
[features]
1417
default = ["default-tls", "default-crypto", "http2"]
1518
default-tls = ["reqwest/default-tls"]
@@ -22,6 +25,10 @@ ring = ["dep:ring"]
2225
# Gracefully falls back to HTTP/1.1 when the server doesn't support it.
2326
http2 = ["reqwest/http2"]
2427
localhost = []
28+
# Puffin compression support for Iceberg table compression
29+
puffin-compression = []
30+
# DataFusion integration for query pushdown support
31+
datafusion = ["dep:datafusion", "dep:arrow", "dep:parquet", "dep:object_store", "dep:tokio"]
2532

2633
[workspace.dependencies]
2734
uuid = "1.18"
@@ -51,7 +58,7 @@ base64 = "0.22"
5158
chrono = { workspace = true, features = ["serde"] }
5259
crc = "3.4"
5360
crc32c = "0.6"
54-
crc32fast = "1.4"
61+
crc32fast = "1.5"
5562
dashmap = "6.1.0"
5663
env_logger = "0.11"
5764
hmac = { version = "0.12", optional = true }
@@ -73,6 +80,13 @@ xmltree = "0.12"
7380
http = { workspace = true }
7481
thiserror = "2.0"
7582
typed-builder = "0.23"
83+
# DataFusion integration (optional, for query pushdown)
84+
datafusion = { version = "51.0", optional = true }
85+
arrow = { version = "57.1", optional = true }
86+
parquet = { version = "57.1", features = ["snap"], optional = true }
87+
object_store = { version = "0.12", optional = true }
88+
tokio = { workspace = true, optional = true, features = ["rt-multi-thread"] }
89+
plotters = "0.3.7"
7690

7791
[dev-dependencies]
7892
minio-common = { path = "./common" }
@@ -83,6 +97,17 @@ clap = { version = "4.5", features = ["derive"] }
8397
rand = { workspace = true, features = ["small_rng"] }
8498
quickcheck = "1.0"
8599
criterion = "0.8"
100+
# DataFusion benchmark dependencies (also available as optional feature)
101+
object_store = { version = "0.12", features = ["aws"] }
102+
futures = "0.3"
103+
# Iceberg-rust for proper manifest file creation in benchmarks
104+
iceberg = { version = "0.7", features = ["storage-s3"] }
105+
iceberg-catalog-rest = "0.7"
106+
# Arrow/Parquet versions matching iceberg-rust 0.7 (v55.1)
107+
# Use package aliasing to avoid conflicts with datafusion's arrow/parquet
108+
arrow-array-55 = { version = "55.1", package = "arrow-array" }
109+
arrow-schema-55 = { version = "55.1", package = "arrow-schema" }
110+
parquet-55 = { version = "55.1", package = "parquet", features = ["async"] }
86111

87112
[lib]
88113
name = "minio"
@@ -103,6 +128,45 @@ name = "append_object"
103128
[[example]]
104129
name = "load_balancing_with_hooks"
105130

131+
[[example]]
132+
name = "tables_stress_throughput_saturation"
133+
path = "examples/s3tables/tables_stress_throughput_saturation.rs"
134+
135+
[[example]]
136+
name = "tables_stress_sustained_load"
137+
path = "examples/s3tables/tables_stress_sustained_load.rs"
138+
139+
[[example]]
140+
name = "tables_stress_state_chaos"
141+
path = "examples/s3tables/tables_stress_state_chaos.rs"
142+
143+
[[example]]
144+
name = "tables_backend_comparison"
145+
path = "examples/s3tables/tables_backend_comparison.rs"
146+
147+
[[example]]
148+
name = "tables_polaris_oauth2"
149+
path = "examples/s3tables/tables_polaris_oauth2.rs"
150+
151+
[[example]]
152+
name = "profile_overhead"
153+
path = "examples/datafusion/profile_overhead.rs"
154+
required-features = ["datafusion"]
155+
156+
[[example]]
157+
name = "minio_table_provider_impl"
158+
path = "examples/datafusion/minio_table_provider_impl.rs"
159+
required-features = ["datafusion"]
160+
161+
[[example]]
162+
name = "s3_performance_comparison"
163+
path = "examples/s3_performance_comparison.rs"
164+
165+
[[example]]
166+
name = "unified_datafusion_benchmark"
167+
path = "examples/datafusion/unified_datafusion_benchmark.rs"
168+
required-features = ["datafusion"]
169+
106170
[[bench]]
107171
name = "s3-api"
108172
path = "benches/s3/api_benchmarks.rs"

0 commit comments

Comments
 (0)