Skip to content

Commit 42e2208

Browse files
committed
docs: update help markdown
[skip ci]
1 parent 9ce7503 commit 42e2208

File tree

3 files changed

+133
-0
lines changed

3 files changed

+133
-0
lines changed

docs/help/TableOfContents.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@
5555
| [safenames](safenames.md)<br>[![CKAN](../images/ckan.png)](#legend "has CKAN-aware integration options.") | Modify headers of a CSV to only have ["safe" names](../../src/cmd/safenames.rs#L5-L14) - guaranteed "database-ready"/"CKAN-ready" names. |
5656
| [sample](sample.md)<br>[📇](#legend "uses an index when available.")[🌐](#legend "has web-aware options.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.") | Randomly draw rows (with optional seed) from a CSV using seven different sampling methods - [reservoir](https://en.wikipedia.org/wiki/Reservoir_sampling) (default), [indexed](https://en.wikipedia.org/wiki/Random_access), [bernoulli](https://en.wikipedia.org/wiki/Bernoulli_sampling), [systematic](https://en.wikipedia.org/wiki/Systematic_sampling), [stratified](https://en.wikipedia.org/wiki/Stratified_sampling), [weighted](https://doi.org/10.1016/j.ipl.2005.11.003) & [cluster sampling](https://en.wikipedia.org/wiki/Cluster_sampling). Supports sampling from CSVs on remote URLs. |
5757
| [schema](schema.md)<br>[📇](#legend "uses an index when available.")[😣](#legend "uses additional memory proportional to the cardinality of the columns in the CSV.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.")[🪄](#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".")[🐻‍❄️](#legend "command powered/accelerated by vectorized query engine.") | Infer either a [JSON Schema Validation Draft 2020-12](https://json-schema.org/draft/2020-12/json-schema-validation) ([Example](https://github.com/dathere/qsv/blob/master/resources/test/311_Service_Requests_from_2010_to_Present-2022-03-04.csv.schema.json)) or [Polars Schema](https://docs.pola.rs/user-guide/lazy/schemas/) ([Example](https://github.com/dathere/qsv/blob/master/resources/test/NYC_311_SR_2010-2020-sample-1M.pschema.json)) from CSV data. In JSON Schema Validation mode, it produces a `.schema.json` file replete with inferred data type & domain/range validation rules derived from [`stats`](../../README.md#stats_deeplink). Uses multithreading to go faster if an index is present. See [`validate`](../../README.md#validate_deeplink) command to use the generated JSON Schema to validate if similar CSVs comply with the schema. With the `--polars` option, it produces a `.pschema.json` file that all polars commands (`sqlp`, `joinp` & `pivotp`) use to determine the data type of each column & to optimize performance. Both schemas are editable and can be fine-tuned. For JSON Schema, to refine the inferred validation rules. For Polars Schema, to change the inferred Polars data types. |
58+
| [scoresql](scoresql.md)<br>[🐻‍❄️](#legend "command powered/accelerated by vectorized query engine.")[🪄](#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".") | Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions BEFORE running the query. Supports Polars (default) and DuckDB modes. |
5859
| [search](search.md)<br>[📇](#legend "uses an index when available.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.") | Run a regex over a CSV. Applies the regex to selected fields & shows only matching rows. |
5960
| [searchset](searchset.md)<br>[📇](#legend "uses an index when available.")[🏎️](#legend "multithreaded and/or faster when an index (📇) is available.")[👆](#legend "has powerful column selector support. See `select` for syntax.") | _Run multiple regexes over a CSV in a single pass._ Applies the regexes to each field individually & shows only matching rows. |
6061
| [select](select.md)<br>[👆](#legend "has powerful column selector support. See `select` for syntax.") | Select, re-order, reverse, duplicate or drop columns. |

docs/help/pragmastat.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,30 @@ qsv pragmastat --compare1 center:42.0,spread:0.5 --select latitude data.csv
151151
qsv pragmastat --compare2 shift:0,disparity:0.8 --select latency_ms,price data.csv
152152
```
153153

154+
> Fast exploratory analysis with subsampling (~100x speedup on large datasets)
155+
156+
```console
157+
qsv pragmastat --standalone --subsample 10000 data.csv
158+
```
159+
160+
> Reproducible subsampling with a specific seed
161+
162+
```console
163+
qsv pragmastat --standalone --subsample 10000 --seed 123 data.csv
164+
```
165+
166+
> Skip confidence bounds for ~2x speedup
167+
168+
```console
169+
qsv pragmastat --standalone --no-bounds data.csv
170+
```
171+
172+
> Combined: ~200x speedup for large datasets
173+
174+
```console
175+
qsv pragmastat --standalone --subsample 10000 --no-bounds data.csv
176+
```
177+
154178
Full Pragmastat manual:
155179
<https://github.com/AndreyAkinshin/pragmastat/releases/download/v12.0.0/pragmastat-v12.0.0.pdf>
156180
<https://pragmastat.dev/> (latest version)
@@ -179,6 +203,9 @@ qsv pragmastat --help
179203
| &nbsp;`--stats-options`&nbsp; | string | Options to pass to the stats command if baseline stats need to be generated. The options are passed as a single string that will be split by whitespace. | `--infer-dates --infer-boolean --mad --quartiles --force --stats-jsonl` |
180204
| &nbsp;`--round`&nbsp; | string | Round statistics to <n> decimal places. Rounding follows Midpoint Nearest Even (Bankers Rounding) rule. | `4` |
181205
| &nbsp;`--force`&nbsp; | flag | Force recomputing ps_* columns even if they already exist in the stats cache. | |
206+
| &nbsp;`--subsample`&nbsp; | string | Randomly subsample N values per column before computing. Speeds up large datasets while maintaining statistical robustness. Recommended: 10000-50000 for exploratory analysis. | |
207+
| &nbsp;`--seed`&nbsp; | string | Seed for reproducible subsampling. If not specified, defaults to 42 when --subsample is used. | |
208+
| &nbsp;`--no-bounds`&nbsp; | flag | Skip confidence bounds computation (~2x faster). Incompatible with --compare1/--compare2. | |
182209

183210
<a name="common-options"></a>
184211

docs/help/scoresql.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# scoresql
2+
3+
> Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a performance score with actionable optimization suggestions BEFORE running the query. Supports Polars (default) and DuckDB modes.
4+
5+
**[Table of Contents](TableOfContents.md)** | **Source: [src/cmd/scoresql.rs](https://github.com/dathere/qsv/blob/master/src/cmd/scoresql.rs)** | [🐻‍❄️](TableOfContents.md#legend "command powered/accelerated by vectorized query engine.")[🪄](TableOfContents.md#legend "\"automagical\" commands that uses stats and/or frequency tables to work \"smarter\" & \"faster\".")
6+
7+
<a name="nav"></a>
8+
[Description](#description) | [Examples](#examples) | [Usage](#usage) | [Scoresql Options](#scoresql-options) | [Common Options](#common-options)
9+
10+
<a name="description"></a>
11+
12+
## Description [](#nav)
13+
14+
Analyze a SQL query against CSV file caches (stats, moarstats, frequency) to produce a
15+
performance score with actionable optimization suggestions BEFORE running the query.
16+
17+
Accepts the same input/SQL arguments as sqlp. Outputs a human-readable performance report
18+
(default) or JSON (--json). Supports Polars mode (default) and DuckDB mode (--duckdb).
19+
20+
Scoring factors include:
21+
* Query plan analysis (EXPLAIN output from Polars or DuckDB)
22+
* Type optimization (column types vs. usage in query)
23+
* Join key cardinality and data distribution
24+
* Filter selectivity from frequency cache
25+
* Query anti-pattern detection (SELECT *, missing LIMIT, cartesian joins, etc.)
26+
* Infrastructure checks (index files, cache freshness)
27+
28+
Caches are auto-generated when missing:
29+
* stats cache via `qsv stats --everything --stats-jsonl`
30+
* frequency cache via `qsv frequency --frequency-jsonl`
31+
32+
33+
<a name="examples"></a>
34+
35+
## Examples [](#nav)
36+
37+
> Score a simple filter query against a single CSV file
38+
39+
```console
40+
qsv scoresql data.csv "SELECT * FROM data WHERE col1 > 10"
41+
```
42+
43+
> Output the score report as JSON instead of the default human-readable format
44+
45+
```console
46+
qsv scoresql --json data.csv "SELECT col1, col2 FROM data ORDER BY col1"
47+
```
48+
49+
> Score a join query across two CSV files
50+
51+
```console
52+
qsv scoresql data.csv data2.csv "SELECT * FROM data JOIN data2 ON data.id = data2.id"
53+
```
54+
55+
> Use DuckDB for query plan analysis instead of Polars
56+
57+
```console
58+
qsv scoresql --duckdb data.csv "SELECT * FROM data WHERE status = 'active'"
59+
```
60+
61+
> Use _t_N aliases just like sqlp (see sqlp documentation)
62+
63+
```console
64+
qsv scoresql data.csv data2.csv "SELECT * FROM _t_1 JOIN _t_2 ON _t_1.id = _t_2.id"
65+
```
66+
67+
For more examples, see [tests](https://github.com/dathere/qsv/blob/master/tests/test_scoresql.rs).
68+
69+
70+
<a name="usage"></a>
71+
72+
## Usage [](#nav)
73+
74+
```console
75+
qsv scoresql [options] <input>... <sql>
76+
qsv scoresql --help
77+
```
78+
79+
<a name="scoresql-options"></a>
80+
81+
## Scoresql Options [](#nav)
82+
83+
| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Option&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Type | Description | Default |
84+
|--------|------|-------------|--------|
85+
| &nbsp;`--json`&nbsp; | flag | Output results as JSON instead of human-readable report. | |
86+
| &nbsp;`--duckdb`&nbsp; | flag | Use DuckDB for query plan analysis instead of Polars. Requires the QSV_DESCRIBEGPT_DB_ENGINE environment variable to be set to the path of the DuckDB binary. | |
87+
| &nbsp;`--try-parsedates`&nbsp; | flag | Automatically try to parse dates/datetimes and time. | |
88+
| &nbsp;`--infer-len`&nbsp; | string | Number of rows to scan when inferring schema. | `10000` |
89+
| &nbsp;`--ignore-errors`&nbsp; | flag | Ignore errors when parsing CSVs. | |
90+
| &nbsp;`--truncate-ragged-lines`&nbsp; | flag | Truncate lines with more fields than the header. | |
91+
92+
<a name="common-options"></a>
93+
94+
## Common Options [](#nav)
95+
96+
| &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Option&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | Type | Description | Default |
97+
|--------|------|-------------|--------|
98+
| &nbsp;`-h,`<br>`--help`&nbsp; | flag | Display this message | |
99+
| &nbsp;`-o,`<br>`--output`&nbsp; | string | Write output to <file> instead of stdout. | |
100+
| &nbsp;`-d,`<br>`--delimiter`&nbsp; | string | The field delimiter for reading CSV data. Must be a single character. | `,` |
101+
| &nbsp;`-q,`<br>`--quiet`&nbsp; | flag | Do not print informational messages to stderr. | |
102+
103+
---
104+
**Source:** [`src/cmd/scoresql.rs`](https://github.com/dathere/qsv/blob/master/src/cmd/scoresql.rs)
105+
| **[Table of Contents](TableOfContents.md)** | **[README](../../README.md)**

0 commit comments

Comments
 (0)