Skip to content

Commit 6d57023

Browse files
jqnatividadclaude
andauthored
feat: new scoresql cmd (#3612)
* feat: add `scoresql` command to analyze SQL query performance before execution Analyzes SQL queries against stats, moarstats, and frequency caches of input CSV files to produce a performance score (0-100) with actionable optimization suggestions. Supports both Polars (default) and DuckDB query plan analysis. Scoring covers type optimization, join cardinality, filter selectivity, data distribution, and query anti-pattern detection (SELECT *, ORDER BY without LIMIT, cartesian joins, etc.). Caches are auto-generated when missing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(scoresql): address review findings — feature gate, dead code, robustness - Fix test feature gate to require both `polars` and `feature_capable` (matching main.rs registration), preventing test failures on polars-only builds - Remove unused SqlInfo fields (_has_group_by, _referenced_tables, _select_columns) and the extract_select_columns function - Improve subquery detection: check for SELECT inside parentheses instead of counting all SELECT occurrences, avoiding false positives from string literals - Fix DuckDB plan table name substitution: sort replacements longest-first to prevent partial matches (e.g., "data" matching inside "data2") - Surface user-visible warnings (wwarn!) when cache generation fails, not just log::warn, so users know scoring may be inaccurate Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): harden subquery detection and deduplicate replacements - Track single-quote state in subquery detector to avoid false positives from SELECT appearing inside string literals (e.g. WHERE col = '(SELECT ...') - Add word-boundary check after SELECT to prevent matching identifiers like SELECTIVITY - Deduplicate table-name replacements to avoid redundant substitutions when a table name happens to equal an alias Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): add pre-SELECT word boundary check and document quote handling Add a preceding word-boundary check so identifiers like PRESELECT are not falsely detected as subqueries. Also add a comment explaining why SQL '' escaped quotes work correctly via toggle symmetry. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): address Copilot review — stats cache, alias ordering, SQL parsing, regex * Remove incorrect first-line skip in load_stats_cache (stats JSONL has no metadata header) * Sort alias replacements by length descending to prevent partial matches (_t_1 inside _t_10) * Add split_on_operators() to correctly parse column names from `col=value` patterns * Use word-boundary regex in get_duckdb_plan instead of naive String::replace Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): labeled break in join parsing, word-boundary regex in Polars plan - Use labeled loop (`'on_clause`) in `extract_join_columns` so that stop-keywords (WHERE, ORDER, etc.) break the outer token loop, not just the inner operator-split loop. - Apply word-boundary regex replacement in `get_polars_plan` (matching the existing `get_duckdb_plan` strategy) to prevent alias partial matches (e.g., alias "data" inside "metadata_col"). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): address Copilot review round 2 — delimiter passthrough, cache freshness, dedup * Use >= in is_cache_fresh to avoid churn on coarse-timestamp filesystems * Deduplicate extracted join columns to prevent double-counting * Pass --delimiter to qsv stats/frequency when generating caches * Use canonical path for cache generation (matches cache lookup path) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(scoresql): make dedup truly case-insensitive and add ASCII delimiter assertions - Use sort_unstable_by/dedup_by with case-insensitive comparison to match the existing comment's claim of case-insensitive deduplication - Add debug_assert!(delim.is_ascii()) guards in stats/freq cache generation to document the ASCII delimiter assumption Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 81d028c commit 6d57023

File tree

5 files changed

+1581
-0
lines changed

5 files changed

+1581
-0
lines changed

src/cmd/mod.rs

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,11 @@ pub mod safenames;
8787
pub mod sample;
8888
#[cfg(any(feature = "feature_capable", feature = "lite"))]
8989
pub mod schema;
90+
#[cfg(all(
91+
feature = "polars",
92+
any(feature = "feature_capable", feature = "datapusher_plus")
93+
))]
94+
pub mod scoresql;
9095
pub mod search;
9196
pub mod searchset;
9297
pub mod select;

0 commit comments

Comments
 (0)