Fix chDB v4 datastore docs: add performance mode and improve content

auxten · auxten · commit 17e9993eef0c · 2026-02-06T23:08:39.000+08:00
- Add performance-mode.md configuration doc
- Fix image reference in datastore overview
- Enhance execution engine and configuration docs
- Improve pandas differences and performance guides
diff --git a/docs/chdb/configuration/execution-engine.md b/docs/chdb/configuration/execution-engine.md
@@ -297,4 +297,12 @@ config.use_chdb()
 
 # Filter early to reduce data size
 result = ds.filter(ds['date'] >= '2024-01-01').to_df()
+
+# For maximum throughput on large datasets, use performance mode
+# which enables parallel Parquet reading and single-SQL aggregation
+config.use_performance_mode()
 ```
+
+:::tip Performance Mode
+If you are running heavy aggregation workloads and don't need exact pandas output compatibility (row order, MultiIndex, dtype corrections), consider using [Performance Mode](performance-mode.md). It automatically sets the engine to `chdb` and removes all pandas compatibility overhead.
+:::
diff --git a/docs/chdb/configuration/index.md b/docs/chdb/configuration/index.md
@@ -9,19 +9,21 @@ doc_type: 'reference'
 
 # DataStore Configuration
 
-DataStore provides comprehensive configuration options for execution engine selection, logging, caching, profiling, and dtype correction.
+DataStore provides comprehensive configuration options for execution engine selection, compatibility mode, logging, caching, profiling, and dtype correction.
 
 ## Quick Reference {#quick-reference}
 
 ```python
 from chdb.datastore.config import config
 
 # Quick setup presets
-config.enable_debug()       # Enable verbose logging
-config.use_chdb()           # Force ClickHouse engine
-config.use_pandas()         # Force pandas engine
-config.use_auto()           # Auto-select engine (default)
-config.enable_profiling()   # Enable performance profiling
+config.enable_debug()           # Enable verbose logging
+config.use_chdb()               # Force ClickHouse engine
+config.use_pandas()             # Force pandas engine
+config.use_auto()               # Auto-select engine (default)
+config.use_performance_mode()   # SQL-first, max throughput
+config.use_pandas_compat()      # Full pandas compatibility (default)
+config.enable_profiling()       # Enable performance profiling
 ```
 
 ## All Configuration Options {#all-options}
@@ -34,6 +36,7 @@ config.enable_profiling()   # Enable performance profiling
 | | `cache_ttl` | float (seconds) | 0.0 | Cache time-to-live |
 | **Engine** | `execution_engine` | "auto", "chdb", "pandas" | "auto" | Execution engine |
 | | `cross_datastore_engine` | "auto", "chdb", "pandas" | "auto" | Cross-DataStore operations |
+| **Compat** | `compat_mode` | "pandas", "performance" | "pandas" | Pandas compatibility vs SQL-first throughput |
 | **Profiling** | `profiling_enabled` | True/False | False | Enable profiling |
 | **Dtype** | `correction_level` | NONE/CRITICAL/HIGH/MEDIUM/ALL | HIGH | Dtype correction level |
 
@@ -103,6 +106,23 @@ print(config.execution_engine)
 
 See [Execution Engine](execution-engine.md) for details.
 
+### Compatibility Mode {#compat-mode}
+
+```python
+# Performance mode: SQL-first, no pandas compatibility overhead
+config.use_performance_mode()
+# or: config.set_compat_mode('performance')
+
+# Pandas compatibility mode (default)
+config.use_pandas_compat()
+# or: config.set_compat_mode('pandas')
+
+# Check current mode
+print(config.compat_mode)  # 'pandas' or 'performance'
+```
+
+See [Performance Mode](performance-mode.md) for details.
+
 ### Profiling Configuration {#profiling}
 
 ```python
@@ -209,6 +229,15 @@ config.set_cache_enabled(True)         # Enable caching
 config.set_profiling_enabled(False)    # Disable profiling overhead
 ```
 
+### Maximum Throughput {#max-throughput-config}
+
+```python
+from chdb.datastore.config import config
+
+config.use_performance_mode()    # SQL-first, no pandas overhead
+config.set_cache_enabled(False)  # Disable cache for streaming
+```
+
 ### Performance Testing {#perf-config}
 
 ```python
@@ -233,6 +262,7 @@ config.enable_debug()        # See what operations are used
 ## Related Documentation {#related}
 
 - [Execution Engine](execution-engine.md) - Engine selection details
+- [Performance Mode](performance-mode.md) - SQL-first mode for maximum throughput
 - [Function Config](function-config.md) - Per-function engine configuration
 - [Logging](../debugging/logging.md) - Logging configuration
 - [Profiling](../debugging/profiling.md) - Performance profiling
diff --git a/docs/chdb/configuration/performance-mode.md b/docs/chdb/configuration/performance-mode.md
@@ -0,0 +1,285 @@
+---
+title: 'Performance Mode (compat_mode)'
+sidebar_label: 'Performance Mode'
+slug: /chdb/configuration/performance-mode
+description: 'SQL-first performance mode that disables pandas compatibility overhead for maximum throughput'
+keywords: ['chdb', 'datastore', 'performance', 'mode', 'compat', 'sql-first', 'optimization']
+doc_type: 'guide'
+---
+
+# Performance Mode
+
+DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance.
+
+## Overview {#overview}
+
+| Mode | `compat_mode` value | Description |
+|------|---------------------|-------------|
+| **Pandas** (default) | `"pandas"` | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, `-If`/`isNaN` wrappers. |
+| **Performance** | `"performance"` | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. |
+
+### What Performance Mode Disables {#what-it-disables}
+
+| Overhead | Pandas mode behavior | Performance mode behavior |
+|----------|---------------------|--------------------------|
+| **Row-order preservation** | `_row_id` injection, `rowNumberInAllBlocks()`, `__orig_row_num__` subqueries | Disabled — row order not guaranteed |
+| **Stable sort tiebreaker** | `rowNumberInAllBlocks() ASC` appended to ORDER BY | Disabled — ties may have arbitrary order |
+| **Parquet preserve_order** | `input_format_parquet_preserve_order=1` | Disabled — parallel Parquet reading allowed |
+| **GroupBy auto ORDER BY** | `ORDER BY group_key` added (pandas default `sort=True`) | Disabled — groups returned in arbitrary order |
+| **GroupBy dropna WHERE** | `WHERE key IS NOT NULL` added (pandas default `dropna=True`) | Disabled — NULL groups included |
+| **GroupBy set_index** | Group keys set as index | Disabled — group keys stay as columns |
+| **MultiIndex columns** | `agg({'col': ['sum','mean']})` returns MultiIndex columns | Disabled — flat column names (`col_sum`, `col_mean`) |
+| **`-If`/`isNaN` wrappers** | `sumIf(col, NOT isNaN(col))` for skipna | Disabled — plain `sum(col)` (ClickHouse natively skips NULL) |
+| **`toInt64` on count** | `toInt64(count())` to match pandas int64 | Disabled — native SQL dtype returned |
+| **`fillna(0)` for all-NaN sum** | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL |
+| **Dtype corrections** | `abs()` unsigned→signed, etc. | Disabled — native SQL dtypes |
+| **Index preservation** | Restores original index after SQL execution | Disabled |
+| **`first()`/`last()`** | `argMin/argMax(col, rowNumberInAllBlocks())` | `any(col)` / `anyLast(col)` — faster but non-deterministic |
+| **Single-SQL aggregation** | ColumnExpr groupby materializes intermediate DataFrame | Injects `LazyGroupByAgg` into lazy ops chain — single SQL query |
+
+---
+
+## Enabling Performance Mode {#enabling}
+
+### Using config object {#using-config}
+
+```python
+from chdb.datastore.config import config
+
+# Enable performance mode
+config.use_performance_mode()
+
+# Back to pandas compatibility
+config.use_pandas_compat()
+
+# Check current mode
+print(config.compat_mode)  # 'pandas' or 'performance'
+```
+
+### Using module-level functions {#using-functions}
+
+```python
+from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode
+
+# Enable performance mode
+set_compat_mode(CompatMode.PERFORMANCE)
+
+# Check
+print(is_performance_mode())  # True
+
+# Back to default
+set_compat_mode(CompatMode.PANDAS)
+```
+
+### Using convenience imports {#using-imports}
+
+```python
+from chdb import use_performance_mode, use_pandas_compat
+
+use_performance_mode()
+# ... high-performance operations ...
+use_pandas_compat()
+```
+
+:::note
+Setting performance mode automatically sets the execution engine to `chdb`. You do not need to call `config.use_chdb()` separately.
+:::
+
+---
+
+## When to Use Performance Mode {#when-to-use}
+
+**Use performance mode when:**
+- Processing large datasets (hundreds of thousands to millions of rows)
+- Running aggregation-heavy workloads (groupby, sum, mean, count)
+- Row order does not matter (e.g., aggregated results, reports, dashboards)
+- You want maximum SQL throughput and minimal overhead
+- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames)
+
+**Stay in pandas mode when:**
+- You need exact pandas behavior (row order, MultiIndex, dtypes)
+- You rely on `first()`/`last()` returning the true first/last row
+- You use `shift()`, `diff()`, `cumsum()` that depend on row order
+- You're writing tests that compare DataStore output with pandas
+
+---
+
+## Behavior Differences {#behavior-differences}
+
+### Row Order {#row-order}
+
+In performance mode, row order is **not guaranteed** for any operation. This includes:
+
+- Filter results
+- GroupBy aggregation results
+- `head()` / `tail()` without explicit `sort_values()`
+- `first()` / `last()` aggregations
+
+If you need ordered results, add an explicit `sort_values()`:
+
+```python
+config.use_performance_mode()
+
+ds = pd.read_csv("data.csv")
+
+# Unordered (fast)
+result = ds.groupby("region")["revenue"].sum()
+
+# Ordered (still fast, just adds ORDER BY)
+result = ds.groupby("region")["revenue"].sum().sort_values()
+```
+
+### GroupBy Results {#groupby-results}
+
+| Aspect | Pandas mode | Performance mode |
+|--------|------------|-----------------|
+| Group key location | Index (via `set_index`) | Regular column |
+| Group order | Sorted by key (default) | Arbitrary order |
+| NULL groups | Excluded (default `dropna=True`) | Included |
+| Column format | MultiIndex for multi-agg | Flat names (`col_func`) |
+| `first()`/`last()` | Deterministic (row order) | Non-deterministic (`any()`/`anyLast()`) |
+
+### Aggregation {#aggregation}
+
+```python
+config.use_performance_mode()
+
+# Sum of all-NaN group returns NULL (not 0)
+# Count returns native uint64 (not forced int64)
+# No -If wrappers: sum() instead of sumIf()
+result = ds.groupby("cat")["val"].sum()
+```
+
+### Single-SQL Execution {#single-sql}
+
+In performance mode, `ColumnExpr` groupby aggregation (e.g., `ds[condition].groupby('col')['val'].sum()`) is executed as a **single SQL query** instead of the two-step process used in pandas mode:
+
+```python
+config.use_performance_mode()
+
+# Pandas mode: two SQL queries (filter → materialize → groupby)
+# Performance mode: one SQL query (WHERE + GROUP BY in same query)
+result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum()
+
+# Generated SQL (single query):
+# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category
+```
+
+This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time.
+
+---
+
+## Comparison with Execution Engine {#vs-execution-engine}
+
+Performance mode (`compat_mode`) and execution engine (`execution_engine`) are **independent configuration axes**:
+
+| Config | Controls | Values |
+|--------|----------|--------|
+| `execution_engine` | **Which engine** runs the computation | `auto`, `chdb`, `pandas` |
+| `compat_mode` | **Whether** to reshape output for pandas compatibility | `pandas`, `performance` |
+
+Setting `compat_mode='performance'` automatically sets `execution_engine='chdb'`, since performance mode is designed for SQL execution.
+
+```python
+from chdb.datastore.config import config
+
+# These are independent
+config.use_chdb()              # Force chDB engine, keep pandas compat
+config.use_performance_mode()  # Force chDB + remove pandas overhead
+```
+
+---
+
+## Testing with Performance Mode {#testing}
+
+When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies:
+
+### Sort-then-compare (aggregations, filters) {#sort-then-compare}
+
+```python
+# Sort both sides by the same columns before comparing
+ds_result = ds.groupby("cat")["val"].sum()
+pd_result = pd_df.groupby("cat")["val"].sum()
+
+ds_sorted = ds_result.sort_index()
+pd_sorted = pd_result.sort_index()
+np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values)
+```
+
+### Value-range check (first/last) {#value-range-check}
+
+```python
+# first() with any() returns an arbitrary element from the group
+result = ds.groupby("cat")["val"].first()
+for group_key in groups:
+    assert result.loc[group_key] in group_values[group_key]
+```
+
+### Schema-and-count (LIMIT without ORDER BY) {#schema-and-count}
+
+```python
+# head() without sort_values: row set is non-deterministic
+result = ds.head(5)
+assert len(result) == 5
+assert set(result.columns) == expected_columns
+```
+
+---
+
+## Best Practices {#best-practices}
+
+### 1. Enable early in your script {#enable-early}
+
+```python
+from chdb.datastore.config import config
+
+config.use_performance_mode()
+
+# All subsequent operations benefit
+ds = pd.read_parquet("data.parquet")
+result = ds[ds["amount"] > 100].groupby("region")["amount"].sum()
+```
+
+### 2. Add explicit sorting when order matters {#explicit-sort}
+
+```python
+# For display or downstream processing that expects order
+result = (ds
+    .groupby("region")["revenue"].sum()
+    .sort_values(ascending=False)
+)
+```
+
+### 3. Use for batch/ETL workloads {#batch-etl}
+
+```python
+config.use_performance_mode()
+
+# ETL pipeline — order doesn't matter, throughput does
+summary = (ds
+    .filter(ds["date"] >= "2024-01-01")
+    .groupby(["region", "product"])
+    .agg({"revenue": "sum", "quantity": "sum", "rating": "mean"})
+)
+summary.to_df().to_parquet("summary.parquet")
+```
+
+### 4. Switch modes within a session {#switch-modes}
+
+```python
+# Performance mode for heavy computation
+config.use_performance_mode()
+aggregated = ds.groupby("cat")["val"].sum()
+
+# Back to pandas mode for exact-match comparison
+config.use_pandas_compat()
+detailed = ds[ds["val"] > 100].head(10)
+```
+
+---
+
+## Related Documentation {#related}
+
+- [Execution Engine](execution-engine.md) — Engine selection (auto/chdb/pandas)
+- [Performance Guide](../guides/pandas-performance.md) — General optimization tips
+- [Key Differences from pandas](../guides/pandas-differences.md) — Behavioral differences
diff --git a/docs/chdb/datastore/index.md b/docs/chdb/datastore/index.md
@@ -22,7 +22,7 @@ DataStore is chDB's pandas-compatible API that combines the familiar pandas Data
 ## Architecture {#architecture}
 
 <div style={{textAlign: 'center'}}>
-  <img src={require('../images/datastore_architecture.png').default} alt="DataStore Architecture" style={{maxWidth: '700px', width: '100%'}} />
+  <img src="../images/datastore_architecture.png" alt="DataStore Architecture" style={{maxWidth: '700px', width: '100%'}} />
 </div>
 
 DataStore uses **lazy evaluation** with **dual-engine execution**:
@@ -123,6 +123,7 @@ DataStore delivers significant performance improvements over pandas, especially
 
 ### Configuration & Debugging {#configuration-debugging}
 - [Configuration](../configuration/index.md) - All configuration options
+- [Performance Mode](../configuration/performance-mode.md) - SQL-first mode for maximum throughput
 - [Debugging](../debugging/index.md) - Explain, profiling, and logging
 
 ### Pandas User Guides {#pandas-user-guides}
diff --git a/docs/chdb/guides/pandas-differences.md b/docs/chdb/guides/pandas-differences.md
diff --git a/docs/chdb/guides/pandas-performance.md b/docs/chdb/guides/pandas-performance.md