|
| 1 | +--- |
| 2 | +title: 'Performance Mode (compat_mode)' |
| 3 | +sidebar_label: 'Performance Mode' |
| 4 | +slug: /chdb/configuration/performance-mode |
| 5 | +description: 'SQL-first performance mode that disables pandas compatibility overhead for maximum throughput' |
| 6 | +keywords: ['chdb', 'datastore', 'performance', 'mode', 'compat', 'sql-first', 'optimization'] |
| 7 | +doc_type: 'guide' |
| 8 | +--- |
| 9 | + |
| 10 | +# Performance Mode |
| 11 | + |
| 12 | +DataStore has two compatibility modes that control whether output is shaped for pandas compatibility or optimized for raw SQL performance. |
| 13 | + |
| 14 | +## Overview {#overview} |
| 15 | + |
| 16 | +| Mode | `compat_mode` value | Description | |
| 17 | +|------|---------------------|-------------| |
| 18 | +| **Pandas** (default) | `"pandas"` | Full pandas behavior compatibility. Row order preserved, MultiIndex, set_index, dtype corrections, stable sort tiebreakers, `-If`/`isNaN` wrappers. | |
| 19 | +| **Performance** | `"performance"` | SQL-first execution. All pandas compatibility overhead removed. Maximum throughput, but results may differ structurally from pandas. | |
| 20 | + |
| 21 | +### What Performance Mode Disables {#what-it-disables} |
| 22 | + |
| 23 | +| Overhead | Pandas mode behavior | Performance mode behavior | |
| 24 | +|----------|---------------------|--------------------------| |
| 25 | +| **Row-order preservation** | `_row_id` injection, `rowNumberInAllBlocks()`, `__orig_row_num__` subqueries | Disabled — row order not guaranteed | |
| 26 | +| **Stable sort tiebreaker** | `rowNumberInAllBlocks() ASC` appended to ORDER BY | Disabled — ties may have arbitrary order | |
| 27 | +| **Parquet preserve_order** | `input_format_parquet_preserve_order=1` | Disabled — parallel Parquet reading allowed | |
| 28 | +| **GroupBy auto ORDER BY** | `ORDER BY group_key` added (pandas default `sort=True`) | Disabled — groups returned in arbitrary order | |
| 29 | +| **GroupBy dropna WHERE** | `WHERE key IS NOT NULL` added (pandas default `dropna=True`) | Disabled — NULL groups included | |
| 30 | +| **GroupBy set_index** | Group keys set as index | Disabled — group keys stay as columns | |
| 31 | +| **MultiIndex columns** | `agg({'col': ['sum','mean']})` returns MultiIndex columns | Disabled — flat column names (`col_sum`, `col_mean`) | |
| 32 | +| **`-If`/`isNaN` wrappers** | `sumIf(col, NOT isNaN(col))` for skipna | Disabled — plain `sum(col)` (ClickHouse natively skips NULL) | |
| 33 | +| **`toInt64` on count** | `toInt64(count())` to match pandas int64 | Disabled — native SQL dtype returned | |
| 34 | +| **`fillna(0)` for all-NaN sum** | Sum of all-NaN returns 0 (pandas behavior) | Disabled — returns NULL | |
| 35 | +| **Dtype corrections** | `abs()` unsigned→signed, etc. | Disabled — native SQL dtypes | |
| 36 | +| **Index preservation** | Restores original index after SQL execution | Disabled | |
| 37 | +| **`first()`/`last()`** | `argMin/argMax(col, rowNumberInAllBlocks())` | `any(col)` / `anyLast(col)` — faster but non-deterministic | |
| 38 | +| **Single-SQL aggregation** | ColumnExpr groupby materializes intermediate DataFrame | Injects `LazyGroupByAgg` into lazy ops chain — single SQL query | |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## Enabling Performance Mode {#enabling} |
| 43 | + |
| 44 | +### Using config object {#using-config} |
| 45 | + |
| 46 | +```python |
| 47 | +from chdb.datastore.config import config |
| 48 | + |
| 49 | +# Enable performance mode |
| 50 | +config.use_performance_mode() |
| 51 | + |
| 52 | +# Back to pandas compatibility |
| 53 | +config.use_pandas_compat() |
| 54 | + |
| 55 | +# Check current mode |
| 56 | +print(config.compat_mode) # 'pandas' or 'performance' |
| 57 | +``` |
| 58 | + |
| 59 | +### Using module-level functions {#using-functions} |
| 60 | + |
| 61 | +```python |
| 62 | +from chdb.datastore.config import set_compat_mode, CompatMode, is_performance_mode |
| 63 | + |
| 64 | +# Enable performance mode |
| 65 | +set_compat_mode(CompatMode.PERFORMANCE) |
| 66 | + |
| 67 | +# Check |
| 68 | +print(is_performance_mode()) # True |
| 69 | + |
| 70 | +# Back to default |
| 71 | +set_compat_mode(CompatMode.PANDAS) |
| 72 | +``` |
| 73 | + |
| 74 | +### Using convenience imports {#using-imports} |
| 75 | + |
| 76 | +```python |
| 77 | +from chdb import use_performance_mode, use_pandas_compat |
| 78 | + |
| 79 | +use_performance_mode() |
| 80 | +# ... high-performance operations ... |
| 81 | +use_pandas_compat() |
| 82 | +``` |
| 83 | + |
| 84 | +:::note |
| 85 | +Setting performance mode automatically sets the execution engine to `chdb`. You do not need to call `config.use_chdb()` separately. |
| 86 | +::: |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## When to Use Performance Mode {#when-to-use} |
| 91 | + |
| 92 | +**Use performance mode when:** |
| 93 | +- Processing large datasets (hundreds of thousands to millions of rows) |
| 94 | +- Running aggregation-heavy workloads (groupby, sum, mean, count) |
| 95 | +- Row order does not matter (e.g., aggregated results, reports, dashboards) |
| 96 | +- You want maximum SQL throughput and minimal overhead |
| 97 | +- Memory usage is a concern (parallel Parquet reading, no intermediate DataFrames) |
| 98 | + |
| 99 | +**Stay in pandas mode when:** |
| 100 | +- You need exact pandas behavior (row order, MultiIndex, dtypes) |
| 101 | +- You rely on `first()`/`last()` returning the true first/last row |
| 102 | +- You use `shift()`, `diff()`, `cumsum()` that depend on row order |
| 103 | +- You're writing tests that compare DataStore output with pandas |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## Behavior Differences {#behavior-differences} |
| 108 | + |
| 109 | +### Row Order {#row-order} |
| 110 | + |
| 111 | +In performance mode, row order is **not guaranteed** for any operation. This includes: |
| 112 | + |
| 113 | +- Filter results |
| 114 | +- GroupBy aggregation results |
| 115 | +- `head()` / `tail()` without explicit `sort_values()` |
| 116 | +- `first()` / `last()` aggregations |
| 117 | + |
| 118 | +If you need ordered results, add an explicit `sort_values()`: |
| 119 | + |
| 120 | +```python |
| 121 | +config.use_performance_mode() |
| 122 | + |
| 123 | +ds = pd.read_csv("data.csv") |
| 124 | + |
| 125 | +# Unordered (fast) |
| 126 | +result = ds.groupby("region")["revenue"].sum() |
| 127 | + |
| 128 | +# Ordered (still fast, just adds ORDER BY) |
| 129 | +result = ds.groupby("region")["revenue"].sum().sort_values() |
| 130 | +``` |
| 131 | + |
| 132 | +### GroupBy Results {#groupby-results} |
| 133 | + |
| 134 | +| Aspect | Pandas mode | Performance mode | |
| 135 | +|--------|------------|-----------------| |
| 136 | +| Group key location | Index (via `set_index`) | Regular column | |
| 137 | +| Group order | Sorted by key (default) | Arbitrary order | |
| 138 | +| NULL groups | Excluded (default `dropna=True`) | Included | |
| 139 | +| Column format | MultiIndex for multi-agg | Flat names (`col_func`) | |
| 140 | +| `first()`/`last()` | Deterministic (row order) | Non-deterministic (`any()`/`anyLast()`) | |
| 141 | + |
| 142 | +### Aggregation {#aggregation} |
| 143 | + |
| 144 | +```python |
| 145 | +config.use_performance_mode() |
| 146 | + |
| 147 | +# Sum of all-NaN group returns NULL (not 0) |
| 148 | +# Count returns native uint64 (not forced int64) |
| 149 | +# No -If wrappers: sum() instead of sumIf() |
| 150 | +result = ds.groupby("cat")["val"].sum() |
| 151 | +``` |
| 152 | + |
| 153 | +### Single-SQL Execution {#single-sql} |
| 154 | + |
| 155 | +In performance mode, `ColumnExpr` groupby aggregation (e.g., `ds[condition].groupby('col')['val'].sum()`) is executed as a **single SQL query** instead of the two-step process used in pandas mode: |
| 156 | + |
| 157 | +```python |
| 158 | +config.use_performance_mode() |
| 159 | + |
| 160 | +# Pandas mode: two SQL queries (filter → materialize → groupby) |
| 161 | +# Performance mode: one SQL query (WHERE + GROUP BY in same query) |
| 162 | +result = ds[ds["rating"] > 3.5].groupby("category")["revenue"].sum() |
| 163 | + |
| 164 | +# Generated SQL (single query): |
| 165 | +# SELECT category, sum(revenue) FROM data WHERE rating > 3.5 GROUP BY category |
| 166 | +``` |
| 167 | + |
| 168 | +This eliminates the intermediate DataFrame materialization and can significantly reduce memory usage and execution time. |
| 169 | + |
| 170 | +--- |
| 171 | + |
| 172 | +## Comparison with Execution Engine {#vs-execution-engine} |
| 173 | + |
| 174 | +Performance mode (`compat_mode`) and execution engine (`execution_engine`) are **independent configuration axes**: |
| 175 | + |
| 176 | +| Config | Controls | Values | |
| 177 | +|--------|----------|--------| |
| 178 | +| `execution_engine` | **Which engine** runs the computation | `auto`, `chdb`, `pandas` | |
| 179 | +| `compat_mode` | **Whether** to reshape output for pandas compatibility | `pandas`, `performance` | |
| 180 | + |
| 181 | +Setting `compat_mode='performance'` automatically sets `execution_engine='chdb'`, since performance mode is designed for SQL execution. |
| 182 | + |
| 183 | +```python |
| 184 | +from chdb.datastore.config import config |
| 185 | + |
| 186 | +# These are independent |
| 187 | +config.use_chdb() # Force chDB engine, keep pandas compat |
| 188 | +config.use_performance_mode() # Force chDB + remove pandas overhead |
| 189 | +``` |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## Testing with Performance Mode {#testing} |
| 194 | + |
| 195 | +When writing tests for performance mode, results may differ from pandas in row order and structural format. Use these strategies: |
| 196 | + |
| 197 | +### Sort-then-compare (aggregations, filters) {#sort-then-compare} |
| 198 | + |
| 199 | +```python |
| 200 | +# Sort both sides by the same columns before comparing |
| 201 | +ds_result = ds.groupby("cat")["val"].sum() |
| 202 | +pd_result = pd_df.groupby("cat")["val"].sum() |
| 203 | + |
| 204 | +ds_sorted = ds_result.sort_index() |
| 205 | +pd_sorted = pd_result.sort_index() |
| 206 | +np.testing.assert_array_equal(ds_sorted.values, pd_sorted.values) |
| 207 | +``` |
| 208 | + |
| 209 | +### Value-range check (first/last) {#value-range-check} |
| 210 | + |
| 211 | +```python |
| 212 | +# first() with any() returns an arbitrary element from the group |
| 213 | +result = ds.groupby("cat")["val"].first() |
| 214 | +for group_key in groups: |
| 215 | + assert result.loc[group_key] in group_values[group_key] |
| 216 | +``` |
| 217 | + |
| 218 | +### Schema-and-count (LIMIT without ORDER BY) {#schema-and-count} |
| 219 | + |
| 220 | +```python |
| 221 | +# head() without sort_values: row set is non-deterministic |
| 222 | +result = ds.head(5) |
| 223 | +assert len(result) == 5 |
| 224 | +assert set(result.columns) == expected_columns |
| 225 | +``` |
| 226 | + |
| 227 | +--- |
| 228 | + |
| 229 | +## Best Practices {#best-practices} |
| 230 | + |
| 231 | +### 1. Enable early in your script {#enable-early} |
| 232 | + |
| 233 | +```python |
| 234 | +from chdb.datastore.config import config |
| 235 | + |
| 236 | +config.use_performance_mode() |
| 237 | + |
| 238 | +# All subsequent operations benefit |
| 239 | +ds = pd.read_parquet("data.parquet") |
| 240 | +result = ds[ds["amount"] > 100].groupby("region")["amount"].sum() |
| 241 | +``` |
| 242 | + |
| 243 | +### 2. Add explicit sorting when order matters {#explicit-sort} |
| 244 | + |
| 245 | +```python |
| 246 | +# For display or downstream processing that expects order |
| 247 | +result = (ds |
| 248 | + .groupby("region")["revenue"].sum() |
| 249 | + .sort_values(ascending=False) |
| 250 | +) |
| 251 | +``` |
| 252 | + |
| 253 | +### 3. Use for batch/ETL workloads {#batch-etl} |
| 254 | + |
| 255 | +```python |
| 256 | +config.use_performance_mode() |
| 257 | + |
| 258 | +# ETL pipeline — order doesn't matter, throughput does |
| 259 | +summary = (ds |
| 260 | + .filter(ds["date"] >= "2024-01-01") |
| 261 | + .groupby(["region", "product"]) |
| 262 | + .agg({"revenue": "sum", "quantity": "sum", "rating": "mean"}) |
| 263 | +) |
| 264 | +summary.to_df().to_parquet("summary.parquet") |
| 265 | +``` |
| 266 | + |
| 267 | +### 4. Switch modes within a session {#switch-modes} |
| 268 | + |
| 269 | +```python |
| 270 | +# Performance mode for heavy computation |
| 271 | +config.use_performance_mode() |
| 272 | +aggregated = ds.groupby("cat")["val"].sum() |
| 273 | + |
| 274 | +# Back to pandas mode for exact-match comparison |
| 275 | +config.use_pandas_compat() |
| 276 | +detailed = ds[ds["val"] > 100].head(10) |
| 277 | +``` |
| 278 | + |
| 279 | +--- |
| 280 | + |
| 281 | +## Related Documentation {#related} |
| 282 | + |
| 283 | +- [Execution Engine](execution-engine.md) — Engine selection (auto/chdb/pandas) |
| 284 | +- [Performance Guide](../guides/pandas-performance.md) — General optimization tips |
| 285 | +- [Key Differences from pandas](../guides/pandas-differences.md) — Behavioral differences |
0 commit comments