docs: update help markdown

jqnatividad · jqnatividad · commit 42e22087816a · 2026-03-13T23:03:29.000-04:00
[skip ci]
diff --git a/docs/help/TableOfContents.md b/docs/help/TableOfContents.md
@@ -55,6 +55,7 @@
 | [safenames](safenames.md)<br>[![CKAN](../images/ckan.png)](#legend "has CKAN-aware integration options.") | Modify headers of a CSV to only have ["safe" names](../../src/cmd/safenames.rs#L5-L14) - guaranteed "database-ready"/"CKAN-ready" names. |
 | [sample](sample.md)<br>[📇](#legend "uses an index when available.")[🌐](#legend "has web-aware options.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.") | Randomly draw rows (with optional seed) from a CSV using seven different sampling methods - [reservoir](https://en.wikipedia.org/wiki/Reservoir_sampling) (default), [indexed](https://en.wikipedia.org/wiki/Random_access), [bernoulli](https://en.wikipedia.org/wiki/Bernoulli_sampling), [systematic](https://en.wikipedia.org/wiki/Systematic_sampling), [stratified](https://en.wikipedia.org/wiki/Stratified_sampling), [weighted](https://doi.org/10.1016/j.ipl.2005.11.003) & [cluster sampling](https://en.wikipedia.org/wiki/Cluster_sampling). Supports sampling from CSVs on remote URLs. |
 | [schema](schema.md)<br>[📇](#legend "uses an index when available.")[😣](#legend "uses additional memory proportional to the cardinality of the columns in the CSV.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.")[🪄](#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".")[🐻‍❄️](#legend "command powered/accelerated by  vectorized query engine.") | Infer either a [JSON Schema Validation Draft 2020-12](https://json-schema.org/draft/2020-12/json-schema-validation) ([Example](https://github.com/dathere/qsv/blob/master/resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json)) or [Polars Schema](https://docs.pola.rs/user-guide/lazy/schemas/) ([Example](https://github.com/dathere/qsv/blob/master/resources/test/NYC_311_SR_2010-2020-sample-1M.pschema.json)) from CSV data. In JSON Schema Validation mode, it produces a `.schema.json` file replete with inferred data type & domain/range validation rules derived from [`stats`](../../README.md#stats_deeplink). Uses multithreading to go faster if an index is present. See [`validate`](../../README.md#validate_deeplink) command to use the generated JSON Schema to validate if similar CSVs comply with the schema. With the `--polars` option, it produces a `.pschema.json` file that all polars commands (`sqlp`, `joinp` & `pivotp`) use to determine the data type of each column & to optimize performance. Both schemas are editable and can be fine-tuned. For JSON Schema, to refine the inferred validation rules. For Polars Schema, to change the inferred Polars data types. |
+| [scoresql](scoresql.md)<br>[🐻‍❄️](#legend "command powered/accelerated by  vectorized query engine.")[🪄](#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".") | Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions BEFORE running the query. Supports Polars (default) and DuckDB modes. |
 | [search](search.md)<br>[📇](#legend "uses an index when available.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.") | Run a regex over a CSV. Applies the regex to selected fields & shows only matching rows. |
 | [searchset](searchset.md)<br>[📇](#legend "uses an index when available.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.") | _Run multiple regexes over a CSV in a single pass._ Applies the regexes to each field individually & shows only matching rows. |
 | [select](select.md)<br>[👆](#legend "has powerful column selector support. See `select` for syntax.") | Select, re-order, reverse, duplicate or drop columns. |
diff --git a/docs/help/pragmastat.md b/docs/help/pragmastat.md
@@ -151,6 +151,30 @@ qsv pragmastat --compare1 center:42.0,spread:0.5 --select latitude data.csv
 qsv pragmastat --compare2 shift:0,disparity:0.8 --select latency_ms,price data.csv
 ```
 
+> Fast exploratory analysis with subsampling (~100x speedup on large datasets)
+
+```console
+qsv pragmastat --standalone --subsample 10000 data.csv
+```
+
+> Reproducible subsampling with a specific seed
+
+```console
+qsv pragmastat --standalone --subsample 10000 --seed 123 data.csv
+```
+
+> Skip confidence bounds for ~2x speedup
+
+```console
+qsv pragmastat --standalone --no-bounds data.csv
+```
+
+> Combined: ~200x speedup for large datasets
+
+```console
+qsv pragmastat --standalone --subsample 10000 --no-bounds data.csv
+```
+
 Full Pragmastat manual:
 <https://github.com/AndreyAkinshin/pragmastat/releases/download/v12.0.0/pragmastat-v12.0.0.pdf>
 <https://pragmastat.dev/> (latest version)
@@ -179,6 +203,9 @@ qsv pragmastat --help
 | &nbsp;`--stats-options`&nbsp; | string | Options to pass to the stats command if baseline stats need to be generated. The options are passed as a single string that will be split by whitespace. | `--infer-dates --infer-boolean --mad --quartiles --force --stats-jsonl` |
 | &nbsp;`--round`&nbsp; | string | Round statistics to <n> decimal places. Rounding follows Midpoint Nearest Even (Bankers Rounding) rule. | `4` |
 | &nbsp;`--force`&nbsp; | flag | Force recomputing ps_* columns even if they already exist in the stats cache. |  |
+| &nbsp;`--subsample`&nbsp; | string | Randomly subsample N values per column before computing. Speeds up large datasets while maintaining statistical robustness. Recommended: 10000-50000 for exploratory analysis. |  |
+| &nbsp;`--seed`&nbsp; | string | Seed for reproducible subsampling. If not specified, defaults to 42 when --subsample is used. |  |
+| &nbsp;`--no-bounds`&nbsp; | flag | Skip confidence bounds computation (~2x faster). Incompatible with --compare1/--compare2. |  |
 
 <a name="common-options"></a>
 
diff --git a/docs/help/scoresql.md b/docs/help/scoresql.md
@@ -0,0 +1,105 @@
+# scoresql
+
+> Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions BEFORE running the query. Supports Polars (default) and DuckDB modes.
+
+**[Table of Contents](TableOfContents.md)** | **Source: [src/cmd/scoresql.rs](https://github.com/dathere/qsv/blob/master/src/cmd/scoresql.rs)** | [🐻‍❄️](TableOfContents.md#legend "command powered/accelerated by  vectorized query engine.")[🪄](TableOfContents.md#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".")
+
+<a name="nav"></a>
+[Description](#description) | [Examples](#examples) | [Usage](#usage) | [Scoresql Options](#scoresql-options) | [Common Options](#common-options)
+
+<a name="description"></a>
+
+## Description [↩](#nav)
+
+Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a
+performance score with actionable optimization suggestions BEFORE running the query.
+
+Accepts the same input/SQL arguments as sqlp. Outputs a human-readable performance report
+(default) or JSON (--json). Supports Polars mode (default) and DuckDB mode (--duckdb).
+
+Scoring factors include:
+* Query plan analysis (EXPLAIN output from Polars or DuckDB)
+* Type optimization (column types vs. usage in query)
+* Join key cardinality and data distribution
+* Filter selectivity from frequency cache
+* Query anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, etc.)
+* Infrastructure checks (index files, cache freshness)
+
+Caches are auto-generated when missing:
+* stats cache via `qsv stats --everything --stats-jsonl`
+* frequency cache via `qsv frequency --frequency-jsonl`
+
+
+<a name="examples"></a>
+
+## Examples [↩](#nav)
+
+> Score a simple filter query against a single CSV file
+
+```console
+qsv scoresql data.csv "SELECT * FROM data WHERE col1 > 10"
+```
+
+> Output the score report as JSON instead of the default human-readable format
+
+```console
+qsv scoresql --json data.csv "SELECT col1, col2 FROM data ORDER BY col1"
+```
+
+> Score a join query across two CSV files
+
+```console
+qsv scoresql data.csv data2.csv "SELECT * FROM data JOIN data2 ON data.id = data2.id"
+```
+
+> Use DuckDB for query plan analysis instead of Polars
+
+```console
+qsv scoresql --duckdb data.csv "SELECT * FROM data WHERE status = 'active'"
+```
+
+> Use _t_N aliases just like sqlp (see sqlp documentation)
+
+```console
+qsv scoresql data.csv data2.csv "SELECT * FROM _t_1 JOIN _t_2 ON _t_1.id = _t_2.id"
+```
+
+For more examples, see [tests](https://github.com/dathere/qsv/blob/master/tests/test_scoresql.rs).
+
+
+<a name="usage"></a>
+
+## Usage [↩](#nav)
+
+```console
+qsv scoresql [options] <input>... <sql>
+qsv scoresql --help
+```
+
+<a name="scoresql-options"></a>
+
+## Scoresql Options [↩](#nav)
+
+| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Option&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Type | Description | Default |
+|--------|------|-------------|--------|
+| &nbsp;`--json`&nbsp; | flag | Output results as JSON instead of human-readable report. |  |
+| &nbsp;`--duckdb`&nbsp; | flag | Use DuckDB for query plan analysis instead of Polars. Requires the QSV_DESCRIBEGPT_DB_ENGINE environment variable to be set to the path of the DuckDB binary. |  |
+| &nbsp;`--try-parsedates`&nbsp; | flag | Automatically try to parse dates/datetimes and time. |  |
+| &nbsp;`--infer-len`&nbsp; | string | Number of rows to scan when inferring schema. | `10000` |
+| &nbsp;`--ignore-errors`&nbsp; | flag | Ignore errors when parsing CSVs. |  |
+| &nbsp;`--truncate-ragged-lines`&nbsp; | flag | Truncate lines with more fields than the header. |  |
+
+<a name="common-options"></a>
+
+## Common Options [↩](#nav)
+
+| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Option&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Type | Description | Default |
+|--------|------|-------------|--------|
+| &nbsp;`-h,`<br>`--help`&nbsp; | flag | Display this message |  |
+| &nbsp;`-o,`<br>`--output`&nbsp; | string | Write output to <file> instead of stdout. |  |
+| &nbsp;`-d,`<br>`--delimiter`&nbsp; | string | The field delimiter for reading CSV data. Must be a single character. | `,` |
+| &nbsp;`-q,`<br>`--quiet`&nbsp; | flag | Do not print informational messages to stderr. |  |
+
+---
+**Source:** [`src/cmd/scoresql.rs`](https://github.com/dathere/qsv/blob/master/src/cmd/scoresql.rs)
+| **[Table of Contents](TableOfContents.md)** | **[README](../../README.md)**