Skip to content

Commit 46a20b2

Browse files
jatorreclaude
andcommitted
Add per-tile statistics columns (v0.5.0 spec, v0.9.0 CLI)
Optional pre-computed per-tile statistics (count, min, max, sum, mean, stddev) as plain Parquet columns alongside each band. Enables UDF-free analytics on any SQL engine — no decompression needed. - Spec bumped to v0.5.0 with tile statistics section - CLI: --tile-stats flag for convert raster command - Validator updated to recognize stats columns - Docs/changelog updated across site Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 015b6aa commit 46a20b2

File tree

13 files changed

+299
-71
lines changed

13 files changed

+299
-71
lines changed

deforest_carbon.parquet

Lines changed: 0 additions & 3 deletions
This file was deleted.

deforest_carbon.vrt

Lines changed: 0 additions & 15 deletions
This file was deleted.

docs/_config.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@ url: https://raquet.io
44
baseurl: ""
55

66
# Versions (auto-updated by CI)
7-
cli_version: "0.8.1"
8-
spec_version: "0.4.0"
7+
cli_version: "0.9.0"
8+
spec_version: "0.5.0"
99

1010
# Build settings
1111
markdown: kramdown

docs/cli.md

Lines changed: 75 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ uv add raquet-io
2828
| `inspect` | Display metadata and statistics |
2929
| `validate` | Validate file structure and data integrity |
3030
| `export geotiff` | Export RaQuet back to GeoTIFF |
31+
| `partition` | Spatially partition a file for cloud storage |
3132
| `split-zoom` | Split by zoom level for optimized remote access |
3233

3334
---
@@ -56,6 +57,11 @@ raquet-io convert raster INPUT_FILE OUTPUT_FILE [OPTIONS]
5657
| `--block-size` | `256` | Block size in pixels: `256`, `512`, or `1024` (see [Block Size](#block-size)) |
5758
| `--target-size` || Target size for auto zoom calculation |
5859
| `--row-group-size` | `200` | Rows per Parquet row group (smaller = better remote pruning) |
60+
| `--overviews` | `auto` | Overview generation: `auto` (full pyramid) or `none` (native resolution only) |
61+
| `--min-zoom` || Minimum zoom level for overviews (overrides auto calculation) |
62+
| `--streaming` || Two-pass streaming mode for memory-safe conversion of large files |
63+
| `--workers` | `1` | Parallel worker processes (requires `--overviews none`) |
64+
| `--tile-stats` || Include per-tile statistics columns (count, min, max, sum, mean, stddev) |
5965
| `--band-layout` | `sequential` | Band storage: `sequential` or `interleaved` |
6066
| `--compression` | `gzip` | Compression: `gzip`, `jpeg`, `webp`, or `none` |
6167
| `--compression-quality` | `85` | Quality for lossy compression (1-100) |
@@ -76,8 +82,17 @@ raquet-io convert raster dem.tif dem.parquet --resampling bilinear
7682
# Larger blocks for dense data
7783
raquet-io convert raster satellite.tif output.parquet --block-size 512
7884

79-
# Verbose output to monitor progress
80-
raquet-io convert raster large.tif output.parquet -v
85+
# Native resolution only (no overview pyramid), faster conversion
86+
raquet-io convert raster large.tif output.parquet --overviews none -v
87+
88+
# Streaming mode for very large files (lower memory usage)
89+
raquet-io convert raster huge.tif output.parquet --streaming -v
90+
91+
# Parallel conversion (4 workers, requires --overviews none)
92+
raquet-io convert raster huge.tif output.parquet --streaming --workers 4 --overviews none -v
93+
94+
# Include per-tile statistics columns for UDF-free analytics
95+
raquet-io convert raster slope.tif slope.parquet --tile-stats --overviews none -v
8196

8297
# Lossy compression for RGB satellite imagery (10-15x smaller files)
8398
raquet-io convert raster satellite.tif output.parquet \
@@ -86,6 +101,21 @@ raquet-io convert raster satellite.tif output.parquet \
86101
--compression-quality 85
87102
```
88103

104+
### Tile Statistics
105+
106+
The `--tile-stats` flag adds pre-computed per-tile statistics as plain Parquet columns alongside each band. For each band, six columns are added: `{band}_count`, `{band}_min`, `{band}_max`, `{band}_sum`, `{band}_mean`, `{band}_stddev`.
107+
108+
This enables **UDF-free analytics** on any SQL engine — no decompression needed:
109+
110+
```sql
111+
-- Works on DuckDB, Snowflake, BigQuery, Databricks — no extensions required
112+
SELECT AVG(band_1_mean) AS avg_slope, MAX(band_1_max) AS steepest
113+
FROM 'slope.parquet'
114+
WHERE block != 0;
115+
```
116+
117+
The overhead is negligible (typically <1% file size increase). See the [specification](https://github.com/CartoDB/raquet/blob/master/format-specs/raquet.md) for details.
118+
89119
### Block Size
90120

91121
The `--block-size` option controls the pixel dimensions of each tile. The default is 256px (the web map standard), but 512px can be beneficial in certain scenarios.
@@ -197,7 +227,7 @@ raquet-io inspect landcover.parquet -v
197227

198228
```
199229
RaQuet File: spain_solar_ghi.parquet
200-
Version: 0.4.0
230+
Version: 0.5.0
201231
Size: 15.2 MB
202232
203233
Dimensions: 9216 x 7936 pixels
@@ -259,7 +289,7 @@ raquet-io validate raster.parquet --json
259289
```
260290
Validating: spain_solar_ghi.parquet
261291
✓ Schema valid
262-
✓ Metadata valid (v0.4.0)
292+
✓ Metadata valid (v0.5.0)
263293
✓ Pyramid complete (zoom 3-9)
264294
✓ Band statistics valid
265295
✓ Data integrity OK
@@ -299,6 +329,47 @@ raquet-io export geotiff raster.parquet output.tif -v
299329

300330
---
301331

332+
## partition
333+
334+
Spatially partition a RaQuet file into multiple files for optimized cloud storage access.
335+
336+
```bash
337+
raquet-io partition INPUT_FILE OUTPUT_DIR [OPTIONS]
338+
```
339+
340+
### Arguments
341+
342+
| Argument | Description |
343+
|----------|-------------|
344+
| `INPUT_FILE` | Path to source RaQuet file |
345+
| `OUTPUT_DIR` | Directory for output partition files |
346+
347+
### Options
348+
349+
| Option | Default | Description |
350+
|--------|---------|-------------|
351+
| `--partition-zoom` | `auto` | QUADBIN zoom level for partitioning, or `auto` |
352+
| `--target-size-mb` | `128` | Target partition file size in MB (used with `auto`) |
353+
| `--row-group-size` | `200` | Rows per Parquet row group |
354+
| `-v, --verbose` || Enable verbose output |
355+
356+
### Examples
357+
358+
```bash
359+
# Auto partition (targets ~128 MB files)
360+
raquet-io partition slope.parquet ./partitioned/
361+
362+
# Custom target size
363+
raquet-io partition slope.parquet ./partitioned/ --target-size-mb 256
364+
365+
# Explicit partition zoom
366+
raquet-io partition slope.parquet ./partitioned/ --partition-zoom 12
367+
```
368+
369+
Partitioning is recommended for large datasets (>1 GB) that will be queried from cloud storage. Each partition file is a valid standalone RaQuet file with its own metadata. Tile statistics columns are preserved automatically.
370+
371+
---
372+
302373
## split-zoom
303374

304375
Split a RaQuet file by zoom level for optimized remote access.

docs/engines.md

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -324,13 +324,39 @@ See [CARTO Analytics Toolbox documentation](https://docs.carto.com/data-and-anal
324324

325325
---
326326

327+
## UDF-Free Analytics with Tile Statistics
328+
329+
RaQuet v0.5.0 files converted with `--tile-stats` include pre-computed per-tile statistics as plain Parquet columns. This means **any SQL engine** can perform raster analytics without UDFs, extensions, or decompression:
330+
331+
```sql
332+
-- Works on ANY Parquet-compatible engine — no extensions needed
333+
-- Find average slope and steepest point across all tiles
334+
SELECT
335+
AVG(band_1_mean) AS avg_slope,
336+
MAX(band_1_max) AS steepest_slope,
337+
SUM(band_1_count) AS total_pixels
338+
FROM 'slope.parquet'
339+
WHERE block != 0;
340+
341+
-- Filter tiles by statistics (e.g., find flat areas for data centers)
342+
SELECT block, band_1_mean, band_1_max
343+
FROM 'slope.parquet'
344+
WHERE block != 0 AND band_1_max < 5.0;
345+
```
346+
347+
Available columns per band: `{band}_count` (int64), `{band}_min`, `{band}_max`, `{band}_sum`, `{band}_mean`, `{band}_stddev` (all float64).
348+
349+
---
350+
327351
## Performance Tips
328352

329-
1. **Use spatial filtering** — Always include `ST_RasterIntersects` or equivalent to enable row group pruning
353+
1. **Use tile statistics for aggregates** — When you only need summary stats (mean, min, max), query the `{band}_mean` etc. columns directly instead of decompressing tile data
354+
355+
2. **Use spatial filtering** — Always include `ST_RasterIntersects` or equivalent to enable row group pruning
330356

331-
2. **Query remote files directly** — Parquet's columnar format enables efficient range requests; no need to download first
357+
3. **Query remote files directly** — Parquet's columnar format enables efficient range requests; no need to download first
332358

333-
3. **Split by zoom for large files** — Use `raquet-io split-zoom` to create per-zoom files for optimal remote queries
359+
4. **Partition large files** — Use `raquet-io partition` to spatially split files for optimal cloud storage parallelism
334360

335-
4. **Small row groups for remote access** — Use `--row-group-size 100-200` when converting files that will be queried remotely
361+
5. **Small row groups for remote access** — Use `--row-group-size 100-200` when converting files that will be queried remotely
336362

docs/faq.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,12 @@ Several options:
112112

113113
## What's the metadata format?
114114

115-
RaQuet v0.4.0 stores metadata as JSON in the `block=0` row:
115+
RaQuet v0.5.0 stores metadata as JSON in the `block=0` row:
116116

117117
```json
118118
{
119119
"file_format": "raquet",
120-
"version": "0.4.0",
120+
"version": "0.5.0",
121121
"width": 32768,
122122
"height": 14848,
123123
"crs": "EPSG:3857",
@@ -186,7 +186,7 @@ RaQuet is ideal when you need to query rasters outside PostgreSQL, join with dat
186186

187187
Yes. RaQuet is used in production at CARTO and is supported by the [Analytics Toolbox](https://carto.com/analytics-toolbox) across BigQuery, Snowflake, Databricks, and PostgreSQL.
188188

189-
The format specification is at v0.4.0. Version 0.3.0 is stable for production use; v0.4.0 adds experimental interleaved band layout and lossy compression support.
189+
The format specification is at v0.5.0. The format is stable for production use. v0.4.0 added interleaved band layout and lossy compression; v0.5.0 adds optional per-tile statistics columns for UDF-free analytics on data warehouses.
190190

191191
---
192192

docs/index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -187,7 +187,13 @@ GeoParquet brought vector data into the lakehouse. RaQuet does the same for rast
187187

188188
## Changelog
189189

190-
### v0.4.0 (Experimental)
190+
### v0.5.0
191+
- **Per-Tile Statistics Columns**: Optional pre-computed statistics (`count`, `min`, `max`, `sum`, `mean`, `stddev`) as plain Parquet columns alongside each band. Enables UDF-free analytics on any SQL engine — no decompression needed.
192+
- **New CLI options**: `--tile-stats` flag for conversion, `--overviews none`, `--streaming`, `--workers` for parallel conversion, and `partition` command for spatial partitioning.
193+
- **Metadata signal**: `tile_statistics` and `tile_statistics_columns` fields in metadata JSON when tile stats are present.
194+
- **Negligible overhead**: Typically <1% file size increase for the statistics columns.
195+
196+
### v0.4.0
191197
- **Interleaved Band Layout**: New `band_layout: "interleaved"` option stores all bands in a single `pixels` column using Band Interleaved by Pixel (BIP) format. This can reduce HTTP requests for RGB visualization by ~40%.
192198
- **Lossy Compression**: Support for JPEG and WebP compression for photographic imagery. Achieves 10-15x smaller files compared to gzip for satellite imagery.
193199
- **New metadata fields**: `band_layout`, `compression_quality` for controlling lossy compression.

format-specs/raquet.md

Lines changed: 49 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# RaQuet Specification v0.4.0
1+
# RaQuet Specification v0.5.0
22

33
## Overview
44

@@ -86,6 +86,45 @@ When `band_layout` is `"interleaved"`:
8686
**Primary Key with Time Dimension:**
8787
When `time_cf` is present, the combination of (`block`, `time_cf`) forms the unique key for each row. Multiple rows may have the same `block` value but different timestamps. Without time columns, `block` alone is unique (excluding the metadata row at `block = 0`).
8888

89+
#### Tile Statistics Columns (Per-Tile Statistics)
90+
91+
Optional pre-computed per-tile statistics for each band. These columns enable analytical queries without decompressing tile pixel data, which is especially valuable for cloud data warehouses (Snowflake, BigQuery, Databricks) where UDF-based decompression is expensive.
92+
93+
**Column naming convention**: `{band_name}_{stat}` where `{band_name}` matches the band column name and `{stat}` is one of the supported statistics.
94+
95+
**Supported statistics**:
96+
97+
| Stat | Type | Description |
98+
|------|------|-------------|
99+
| `count` | int64 | Number of valid (non-nodata) pixels in the tile |
100+
| `min` | float64 | Minimum pixel value (excluding nodata) |
101+
| `max` | float64 | Maximum pixel value (excluding nodata) |
102+
| `sum` | float64 | Sum of all valid pixel values |
103+
| `mean` | float64 | Mean of valid pixel values |
104+
| `stddev` | float64 | Population standard deviation of valid pixel values |
105+
106+
**Example columns** for a file with `band_1`:
107+
- `band_1_count` (int64)
108+
- `band_1_min` (float64)
109+
- `band_1_max` (float64)
110+
- `band_1_sum` (float64)
111+
- `band_1_mean` (float64)
112+
- `band_1_stddev` (float64)
113+
114+
**Rules**:
115+
- The metadata row (`block = 0`) MUST have NULL values for all tile statistics columns.
116+
- Statistics are computed from raw (pre-compression) pixel data. For lossy compression (JPEG/WebP), statistics reflect source values, not post-decode values.
117+
- Empty tiles (all nodata) are excluded during conversion and have no statistics rows.
118+
- For time-series data with `time_cf`, statistics are per-tile-per-timestep (one set of stats per row).
119+
- Overview tiles have their own statistics computed from their (coarser resolution) pixel data.
120+
- For interleaved band layout, statistics columns use the original band names from metadata (e.g., `band_1_mean`, `band_2_mean`), not the `pixels` column name.
121+
122+
**Metadata signal**: When tile statistics columns are present, the metadata JSON includes:
123+
- `"tile_statistics": true` — indicates that per-tile statistics columns are present.
124+
- `"tile_statistics_columns": ["count", "min", "max", "sum", "mean", "stddev"]` — lists which statistics are included.
125+
126+
**Compatibility**: Tile statistics columns are plain Parquet columns. Any Parquet reader (pandas, DuckDB, Snowflake, BigQuery, Spark, Polars) can query them directly without UDFs or custom extensions. Files without tile statistics columns remain valid RaQuet.
127+
89128
## Tiling Scheme
90129

91130
RaQuet uses the **QUADBIN** tiling scheme for spatial indexing. QUADBIN is a hierarchical geospatial index that encodes Web Mercator tile coordinates `(x, y, z)` into a single 64-bit integer. This encoding enables efficient spatial queries and Parquet row group pruning.
@@ -158,7 +197,7 @@ The metadata is stored as a JSON string in the `metadata` column where `block =
158197
```json
159198
{
160199
"file_format": "raquet",
161-
"version": "0.4.0",
200+
"version": "0.5.0",
162201
"width": 9216,
163202
"height": 7936,
164203
"crs": "EPSG:3857",
@@ -213,7 +252,7 @@ The metadata is stored as a JSON string in the `metadata` column where `block =
213252

214253
- **Format Identification**
215254
- `file_format`: String identifying this as a RaQuet file. MUST be `"raquet"`.
216-
- `version`: String indicating the RaQuet specification version. Current version is "0.4.0".
255+
- `version`: String indicating the RaQuet specification version. Current version is "0.5.0".
217256

218257
- **Raster Dimensions**
219258
- `width`, `height`: Integers specifying full resolution raster dimensions in pixels.
@@ -338,7 +377,7 @@ The metadata is stored as a JSON string in the `metadata` column where `block =
338377
```json
339378
{
340379
"file_format": "raquet",
341-
"version": "0.4.0",
380+
"version": "0.5.0",
342381
"width": 9216,
343382
"height": 7936,
344383
"crs": "EPSG:3857",
@@ -380,7 +419,7 @@ The metadata is stored as a JSON string in the `metadata` column where `block =
380419
```json
381420
{
382421
"file_format": "raquet",
383-
"version": "0.4.0",
422+
"version": "0.5.0",
384423
"width": 1024,
385424
"height": 1024,
386425
"crs": "EPSG:3857",
@@ -435,7 +474,7 @@ The metadata is stored as a JSON string in the `metadata` column where `block =
435474
```json
436475
{
437476
"file_format": "raquet",
438-
"version": "0.4.0",
477+
"version": "0.5.0",
439478
"width": 10980,
440479
"height": 10980,
441480
"crs": "EPSG:3857",
@@ -485,7 +524,7 @@ This example shows a Sentinel-2 TCI (True Color Image) stored with interleaved b
485524
```json
486525
{
487526
"file_format": "raquet",
488-
"version": "0.4.0",
527+
"version": "0.5.0",
489528
"width": 32768,
490529
"height": 14848,
491530
"crs": "EPSG:3857",
@@ -521,7 +560,7 @@ This example shows a Sentinel-2 TCI (True Color Image) stored with interleaved b
521560
```json
522561
{
523562
"file_format": "raquet",
524-
"version": "0.4.0",
563+
"version": "0.5.0",
525564
"width": 1440,
526565
"height": 721,
527566
"crs": "EPSG:3857",
@@ -571,7 +610,7 @@ This example represents 36 years (1980-2015) of monthly sea surface temperature
571610
```json
572611
{
573612
"file_format": "raquet",
574-
"version": "0.4.0",
613+
"version": "0.5.0",
575614
"width": 400752,
576615
"height": 131072,
577616
"crs": "EPSG:3857",
@@ -686,7 +725,7 @@ Producers MAY extend the metadata with custom fields. To avoid conflicts with fu
686725
```json
687726
{
688727
"file_format": "raquet",
689-
"version": "0.4.0",
728+
"version": "0.5.0",
690729
"custom": {
691730
"organization": "ACME Corp",
692731
"project_id": "climate-2024",

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "raquet-io"
3-
version = "0.8.1"
3+
version = "0.9.0"
44
description = "RaQuet - Raster data in Parquet format with QUADBIN spatial indexing"
55
readme = "README.md"
66
license = {text = "BSD-3-Clause"}

0 commit comments

Comments
 (0)