Skip to content

Commit 17e9993

Browse files
committed
Fix chDB v4 datastore docs: add performance mode and improve content
- Add performance-mode.md configuration doc - Fix image reference in datastore overview - Enhance execution engine and configuration docs - Improve pandas differences and performance guides
1 parent ce03f6a commit 17e9993

File tree

6 files changed

+368
-16
lines changed

6 files changed

+368
-16
lines changed

docs/chdb/configuration/execution-engine.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -297,4 +297,12 @@ config.use_chdb()
297297

298298
# Filter early to reduce data size
299299
result = ds.filter(ds['date'] >= '2024-01-01').to_df()
300+
301+
# For maximum throughput on large datasets, use performance mode
302+
# which enables parallel Parquet reading and single-SQL aggregation
303+
config.use_performance_mode()
300304
```
305+
306+
:::tip Performance Mode
307+
If you are running heavy aggregation workloads and don't need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using [Performance Mode](performance-mode.md). It automatically sets the engine to `chdb` and removes all pandas compatibility overhead.
308+
:::

docs/chdb/configuration/index.md

Lines changed: 36 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,21 @@ doc_type: 'reference'
99

1010
# DataStore Configuration
1111

12-
DataStore provides comprehensive configuration options for execution engine selection, logging, caching, profiling, and dtype correction.
12+
DataStore provides comprehensive configuration options for execution engine selection, compatibility mode, logging, caching, profiling, and dtype correction.
1313

1414
## Quick Reference {#quick-reference}
1515

1616
```python
1717
from chdb.datastore.config import config
1818

1919
# Quick setup presets
20-
config.enable_debug() # Enable verbose logging
21-
config.use_chdb() # Force ClickHouse engine
22-
config.use_pandas() # Force pandas engine
23-
config.use_auto() # Auto-select engine (default)
24-
config.enable_profiling() # Enable performance profiling
20+
config.enable_debug() # Enable verbose logging
21+
config.use_chdb() # Force ClickHouse engine
22+
config.use_pandas() # Force pandas engine
23+
config.use_auto() # Auto-select engine (default)
24+
config.use_performance_mode() # SQL-first, max throughput
25+
config.use_pandas_compat() # Full pandas compatibility (default)
26+
config.enable_profiling() # Enable performance profiling
2527
```
2628

2729
## All Configuration Options {#all-options}
@@ -34,6 +36,7 @@ config.enable_profiling() # Enable performance profiling
3436
| | `cache_ttl` | float (seconds) | 0.0 | Cache time-to-live |
3537
| **Engine** | `execution_engine` | "auto", "chdb", "pandas" | "auto" | Execution engine |
3638
| | `cross_datastore_engine` | "auto", "chdb", "pandas" | "auto" | Cross-DataStore operations |
39+
| **Compat** | `compat_mode` | "pandas", "performance" | "pandas" | Pandas compatibility vs SQL-first throughput |
3740
| **Profiling** | `profiling_enabled` | True/False | False | Enable profiling |
3841
| **Dtype** | `correction_level` | NONE/CRITICAL/HIGH/MEDIUM/ALL | HIGH | Dtype correction level |
3942

@@ -103,6 +106,23 @@ print(config.execution_engine)
103106

104107
See [Execution Engine](execution-engine.md) for details.
105108

109+
### Compatibility Mode {#compat-mode}
110+
111+
```python
112+
# Performance mode: SQL-first, no pandas compatibility overhead
113+
config.use_performance_mode()
114+
# or: config.set_compat_mode('performance')
115+
116+
# Pandas compatibility mode (default)
117+
config.use_pandas_compat()
118+
# or: config.set_compat_mode('pandas')
119+
120+
# Check current mode
121+
print(config.compat_mode) # 'pandas' or 'performance'
122+
```
123+
124+
See [Performance Mode](performance-mode.md) for details.
125+
106126
### Profiling Configuration {#profiling}
107127

108128
```python
@@ -209,6 +229,15 @@ config.set_cache_enabled(True) # Enable caching
209229
config.set_profiling_enabled(False) # Disable profiling overhead
210230
```
211231

232+
### Maximum Throughput {#max-throughput-config}
233+
234+
```python
235+
from chdb.datastore.config import config
236+
237+
config.use_performance_mode() # SQL-first, no pandas overhead
238+
config.set_cache_enabled(False) # Disable cache for streaming
239+
```
240+
212241
### Performance Testing {#perf-config}
213242

214243
```python
@@ -233,6 +262,7 @@ config.enable_debug() # See what operations are used
233262
## Related Documentation {#related}
234263

235264
- [Execution Engine](execution-engine.md) - Engine selection details
265+
- [Performance Mode](performance-mode.md) - SQL-first mode for maximum throughput
236266
- [Function Config](function-config.md) - Per-function engine configuration
237267
- [Logging](../debugging/logging.md) - Logging configuration
238268
- [Profiling](../debugging/profiling.md) - Performance profiling
Lines changed: 285 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,285 @@
1+
---
2+
title: 'Performance Mode (compat_mode)'
3+
sidebar_label: 'Performance Mode'
4+
slug: /chdb/configuration/performance-mode
5+
description: 'SQL-first performance mode that disables pandas compatibility overhead for maximum throughput'
6+
keywords: ['chdb', 'datastore', 'performance', 'mode', 'compat', 'sql-first', 'optimization']
7+
doc_type: 'guide'
8+
---
9+
10+
# Performance Mode
11+
12+
DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.
13+
14+
## Overview {#overview}
15+
16+
| Mode | `compat_mode` value | Description |
17+
|------|---------------------|-------------|
18+
| **Pandas** (default) | `"pandas"` | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, `-If`/`isNaN` wrappers. |
19+
| **Performance** | `"performance"` | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. |
20+
21+
### What Performance Mode Disables {#what-it-disables}
22+
23+
| Overhead | Pandas mode behavior | Performance mode behavior |
24+
|----------|---------------------|--------------------------|
25+
| **Row-order preservation** | `_row_id` injection, `rowNumberInAllBlocks()`, `__orig_row_num__` subqueries | Disabled — row order not guaranteed |
26+
| **Stable sort tiebreaker** | `rowNumberInAllBlocks() ASC` appended to ORDER BY | Disabled — ties may have arbitrary order |
27+
| **Parquet preserve_order** | `input_format_parquet_preserve_order=1` | Disabled — parallel Parquet reading allowed |
28+
| **GroupBy auto ORDER BY** | `ORDER BY group_key` added (pandas default `sort=True`) | Disabled — groups returned in arbitrary order |
29+
| **GroupBy dropna WHERE** | `WHERE key IS NOT NULL` added (pandas default `dropna=True`) | Disabled — NULL groups included |
30+
| **GroupBy set_index** | Group keys set as index | Disabled — group keys stay as columns |
31+
| **MultiIndex columns** | `agg({'col': ['sum','mean']})` returns MultiIndex columns | Disabled — flat column names (`col_sum`, `col_mean`) |
32+
| **`-If`/`isNaN` wrappers** | `sumIf(col, NOT isNaN(col))` for skipna | Disabled — plain `sum(col)` (ClickHouse natively skips NULL) |
33+
| **`toInt64` on count** | `toInt64(count())` to match pandas int64 | Disabled — native SQL dtype returned |
34+
| **`fillna(0)` for all-NaN sum** | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL |
35+
| **Dtype corrections** | `abs()` unsigned→signed, etc. | Disabled — native SQL dtypes |
36+
| **Index preservation** | Restores original index after SQL execution | Disabled |
37+
| **`first()`/`last()`** | `argMin/argMax(col, rowNumberInAllBlocks())` | `any(col)` / `anyLast(col)` — faster but non-deterministic |
38+
| **Single-SQL aggregation** | ColumnExpr groupby materializes intermediate DataFrame | Injects `LazyGroupByAgg` into lazy ops chain — single SQL query |
39+
40+
---
41+
42+
## Enabling Performance Mode {#enabling}
43+
44+
### Using config object {#using-config}
45+
46+
```python
47+
from chdb.datastore.config import config
48+
49+
# Enable performance mode
50+
config.use_performance_mode()
51+
52+
# Back to pandas compatibility
53+
config.use_pandas_compat()
54+
55+
# Check current mode
56+
print(config.compat_mode) # 'pandas' or 'performance'
57+
```
58+
59+
### Using module-level functions {#using-functions}
60+
61+
```python
62+
from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode
63+
64+
# Enable performance mode
65+
set_compat_mode(CompatMode.PERFORMANCE)
66+
67+
# Check
68+
print(is_performance_mode()) # True
69+
70+
# Back to default
71+
set_compat_mode(CompatMode.PANDAS)
72+
```
73+
74+
### Using convenience imports {#using-imports}
75+
76+
```python
77+
from chdb import use_performance_mode, use_pandas_compat
78+
79+
use_performance_mode()
80+
# ... high-performance operations ...
81+
use_pandas_compat()
82+
```
83+
84+
:::note
85+
Setting performance mode automatically sets the execution engine to `chdb`. You do not need to call `config.use_chdb()` separately.
86+
:::
87+
88+
---
89+
90+
## When to Use Performance Mode {#when-to-use}
91+
92+
**Use performance mode when:**
93+
- Processing large datasets (hundreds of thousands to millions of rows)
94+
- Running aggregation-heavy workloads (groupby, sum, mean, count)
95+
- Row order does not matter (e.g., aggregated results, reports, dashboards)
96+
- You want maximum SQL throughput and minimal overhead
97+
- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)
98+
99+
**Stay in pandas mode when:**
100+
- You need exact pandas behavior (row order, MultiIndex, dtypes)
101+
- You rely on `first()`/`last()` returning the true first/last row
102+
- You use `shift()`, `diff()`, `cumsum()` that depend on row order
103+
- You're writing tests that compare DataStore output with pandas
104+
105+
---
106+
107+
## Behavior Differences {#behavior-differences}
108+
109+
### Row Order {#row-order}
110+
111+
In performance mode, row order is **not guaranteed** for any operation. This includes:
112+
113+
- Filter results
114+
- GroupBy aggregation results
115+
- `head()` / `tail()` without explicit `sort_values()`
116+
- `first()` / `last()` aggregations
117+
118+
If you need ordered results, add an explicit `sort_values()`:
119+
120+
```python
121+
config.use_performance_mode()
122+
123+
ds = pd.read_csv("data.csv")
124+
125+
# Unordered (fast)
126+
result = ds.groupby("region")["revenue"].sum()
127+
128+
# Ordered (still fast, just adds ORDER BY)
129+
result = ds.groupby("region")["revenue"].sum().sort_values()
130+
```
131+
132+
### GroupBy Results {#groupby-results}
133+
134+
| Aspect | Pandas mode | Performance mode |
135+
|--------|------------|-----------------|
136+
| Group key location | Index (via `set_index`) | Regular column |
137+
| Group order | Sorted by key (default) | Arbitrary order |
138+
| NULL groups | Excluded (default `dropna=True`) | Included |
139+
| Column format | MultiIndex for multi-agg | Flat names (`col_func`) |
140+
| `first()`/`last()` | Deterministic (row order) | Non-deterministic (`any()`/`anyLast()`) |
141+
142+
### Aggregation {#aggregation}
143+
144+
```python
145+
config.use_performance_mode()
146+
147+
# Sum of all-NaN group returns NULL (not 0)
148+
# Count returns native uint64 (not forced int64)
149+
# No -If wrappers: sum() instead of sumIf()
150+
result = ds.groupby("cat")["val"].sum()
151+
```
152+
153+
### Single-SQL Execution {#single-sql}
154+
155+
In performance mode, `ColumnExpr` groupby aggregation (e.g., `ds[condition].groupby('col')['val'].sum()`) is executed as a **single SQL query** instead of the two-step process used in pandas mode:
156+
157+
```python
158+
config.use_performance_mode()
159+
160+
# Pandas mode: two SQL queries (filter → materialize → groupby)
161+
# Performance mode: one SQL query (WHERE + GROUP BY in same query)
162+
result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum()
163+
164+
# Generated SQL (single query):
165+
# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category
166+
```
167+
168+
This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.
169+
170+
---
171+
172+
## Comparison with Execution Engine {#vs-execution-engine}
173+
174+
Performance mode (`compat_mode`) and execution engine (`execution_engine`) are **independent configuration axes**:
175+
176+
| Config | Controls | Values |
177+
|--------|----------|--------|
178+
| `execution_engine` | **Which engine** runs the computation | `auto`, `chdb`, `pandas` |
179+
| `compat_mode` | **Whether** to reshape output for pandas compatibility | `pandas`, `performance` |
180+
181+
Setting `compat_mode='performance'` automatically sets `execution_engine='chdb'`, since performance mode is designed for SQL execution.
182+
183+
```python
184+
from chdb.datastore.config import config
185+
186+
# These are independent
187+
config.use_chdb() # Force chDB engine, keep pandas compat
188+
config.use_performance_mode() # Force chDB + remove pandas overhead
189+
```
190+
191+
---
192+
193+
## Testing with Performance Mode {#testing}
194+
195+
When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:
196+
197+
### Sort-then-compare (aggregations, filters) {#sort-then-compare}
198+
199+
```python
200+
# Sort both sides by the same columns before comparing
201+
ds_result = ds.groupby("cat")["val"].sum()
202+
pd_result = pd_df.groupby("cat")["val"].sum()
203+
204+
ds_sorted = ds_result.sort_index()
205+
pd_sorted = pd_result.sort_index()
206+
np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values)
207+
```
208+
209+
### Value-range check (first/last) {#value-range-check}
210+
211+
```python
212+
# first() with any() returns an arbitrary element from the group
213+
result = ds.groupby("cat")["val"].first()
214+
for group_key in groups:
215+
assert result.loc[group_key] in group_values[group_key]
216+
```
217+
218+
### Schema-and-count (LIMIT without ORDER BY) {#schema-and-count}
219+
220+
```python
221+
# head() without sort_values: row set is non-deterministic
222+
result = ds.head(5)
223+
assert len(result) == 5
224+
assert set(result.columns) == expected_columns
225+
```
226+
227+
---
228+
229+
## Best Practices {#best-practices}
230+
231+
### 1. Enable early in your script {#enable-early}
232+
233+
```python
234+
from chdb.datastore.config import config
235+
236+
config.use_performance_mode()
237+
238+
# All subsequent operations benefit
239+
ds = pd.read_parquet("data.parquet")
240+
result = ds[ds["amount"] > 100].groupby("region")["amount"].sum()
241+
```
242+
243+
### 2. Add explicit sorting when order matters {#explicit-sort}
244+
245+
```python
246+
# For display or downstream processing that expects order
247+
result = (ds
248+
.groupby("region")["revenue"].sum()
249+
.sort_values(ascending=False)
250+
)
251+
```
252+
253+
### 3. Use for batch/ETL workloads {#batch-etl}
254+
255+
```python
256+
config.use_performance_mode()
257+
258+
# ETL pipeline — order doesn't matter, throughput does
259+
summary = (ds
260+
.filter(ds["date"] >= "2024-01-01")
261+
.groupby(["region", "product"])
262+
.agg({"revenue": "sum", "quantity": "sum", "rating": "mean"})
263+
)
264+
summary.to_df().to_parquet("summary.parquet")
265+
```
266+
267+
### 4. Switch modes within a session {#switch-modes}
268+
269+
```python
270+
# Performance mode for heavy computation
271+
config.use_performance_mode()
272+
aggregated = ds.groupby("cat")["val"].sum()
273+
274+
# Back to pandas mode for exact-match comparison
275+
config.use_pandas_compat()
276+
detailed = ds[ds["val"] > 100].head(10)
277+
```
278+
279+
---
280+
281+
## Related Documentation {#related}
282+
283+
- [Execution Engine](execution-engine.md) — Engine selection (auto/chdb/pandas)
284+
- [Performance Guide](../guides/pandas-performance.md) — General optimization tips
285+
- [Key Differences from pandas](../guides/pandas-differences.md) — Behavioral differences

docs/chdb/datastore/index.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ DataStore is chDB's pandas-compatible API that combines the familiar pandas Data
2222
## Architecture {#architecture}
2323

2424
<div style={{textAlign: 'center'}}>
25-
<img src={require('../images/datastore_architecture.png').default} alt="DataStore Architecture" style={{maxWidth: '700px', width: '100%'}} />
25+
<img src="../images/datastore_architecture.png" alt="DataStore Architecture" style={{maxWidth: '700px', width: '100%'}} />
2626
</div>
2727

2828
DataStore uses **lazy evaluation** with **dual-engine execution**:
@@ -123,6 +123,7 @@ DataStore delivers significant performance improvements over pandas, especially
123123

124124
### Configuration & Debugging {#configuration-debugging}
125125
- [Configuration](../configuration/index.md) - All configuration options
126+
- [Performance Mode](../configuration/performance-mode.md) - SQL-first mode for maximum throughput
126127
- [Debugging](../debugging/index.md) - Explain, profiling, and logging
127128

128129
### Pandas User Guides {#pandas-user-guides}

0 commit comments

Comments
 (0)