Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/continuous-integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ jobs:
run: uv run pytest tests -v
- name: Check docs
run: uv run mkdocs build --strict
- name: Check jsonschema
run: check-jsonschema --schemafile spec/json-schema/metadata.json spec/example-metadata.json
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,11 @@ repos:
- id: trailing-whitespace
- id: end-of-file-fixer
exclude: tests/.*\.json
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.11.8
hooks:
# Run the linter.
- id: ruff
# Run the formatter.
- id: ruff-format
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,9 @@ uv run pre-commit install
uv run pytest
scripts/lint
```

Validate the example collection metadata against the jsonschema:

```shell
check-jsonschema --schemafile spec/json-schema/metadata.json spec/example-metadata.json
```
6 changes: 4 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,19 @@ pc = ["adlfs", "azure-data-tables", "psycopg[binary,pool]", "pypgstac", "tqdm"]

[dependency-groups]
dev = [
"check-jsonschema",
"jsonschema",
"mypy",
"numpy>=2",
"ruff",
"pre-commit",
"pytest-recording>=0.13.2",
"pytest",
"requests",
"ruff",
"stac-geoparquet[pc]",
"stac-geoparquet[pgstac]",
"types-python-dateutil",
"types-requests",
"pytest-recording>=0.13.2",
"vcrpy>=7.0.0",
]
docs = [
Expand Down
3 changes: 1 addition & 2 deletions scripts/lint
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,5 @@

set -e

uv run ruff check
uv run ruff format --check
uv run pre-commit run --all-files
uv run mypy stac_geoparquet
40 changes: 40 additions & 0 deletions spec/example-metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
{
"version": "1.0.0",
"collection": {
"id": "simple-collection",
"type": "Collection",
"stac_extensions": [],
"stac_version": "1.1.0",
"description": "A simple collection demonstrating core catalog fields with links to a couple of items",
"title": "Simple Example Collection",
"keywords": [
"simple",
"example",
"collection"
],
"providers": [],
"extent": {
"spatial": {
"bbox": [
[
172.91173669923782,
1.3438851951615003,
172.95469614953714,
1.3690476620161975
]
]
},
"temporal": {
"interval": [
[
"2020-12-11T22:38:32.125Z",
"2020-12-14T18:02:31.437Z"
]
]
}
},
"license": "CC-BY-4.0",
"summaries": {},
"links": []
}
}
19 changes: 19 additions & 0 deletions spec/json-schema/metadata.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://stac-utils.github.io/stac-geoparquet/json-schema/metadata.json",
"title": "STAC GeoParquet Metadata",
"description": "JSON Schema for STAC GeoParquet metadata stored in Parquet file metadata",
"type": "object",
"properties": {
"version": {
"type": "string",
"const": "1.0.0",
"description": "The stac-geoparquet metadata version."
},
"collection": {
"type": "object",
"description": "This object represents a Collection in a SpatioTemporal Asset Catalog. Note that this object is not validated against the STAC Collection schema. You'll need to validate it separately from stac-geoparquet."
}
},
"required": ["version"]
}
49 changes: 41 additions & 8 deletions spec/stac-geoparquet-spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,11 +31,11 @@ most of the fields should be the same in STAC and in GeoParquet.
| _property columns_ | _varies_ | - | Each property should use the relevant Parquet type, and be pulled out of the properties object to be a top-level Parquet field |

- Must be valid GeoParquet, with proper metadata. Ideally the geometry types are defined and as narrow as possible.
- Strongly recommend to only have one GeoParquet per STAC 'Collection'. Not doing this will lead to an expanded GeoParquet schema (the union of all the schemas of the collection) with lots of empty data
- Strongly recommend storing items that are mostly homogeneous (i.e. have the same fields). Parquet is a columnar format; storing items with many different fields will lead to an expanded parquet Schema with lots of empty data. In practice, this means storing a single collection or only collections with very similar item properties in a single stac-geoparquet dataset.
- Any field in 'properties' of the STAC item should be moved up to be a top-level field in the GeoParquet.
- STAC GeoParquet does not support properties that are named such that they collide with a top-level key.
- datetime columns should be stored as a [native timestamp][timestamp], not as a string
- The Collection JSON should be included in the Parquet metadata. See [Collection JSON](#including-a-stac-collection-json-in-a-stac-geoparquet-collection) below.
- The Collection JSON objects should be included in the Parquet metadata. See [Collection JSON](#stac-collection-objects) below.
- Any other properties that would be stored as GeoJSON in a STAC JSON Item (e.g. `proj:geometry`) should be stored as a binary column with WKB encoding. This simplifies the handling of collections with multiple geometry types.

### Link Struct
Expand Down Expand Up @@ -69,17 +69,48 @@ To take advantage of Parquet's columnar nature and compression, the assets shoul

See [Asset Object][asset] for more.

## Including a STAC Collection JSON in a STAC Geoparquet Collection
### Parquet Metadata

stac-geoparquet uses Parquet [File Metadata](https://parquet.apache.org/docs/file-format/metadata/) to store metadata about the dataset.
All stac-geoparquet metadata is stored under the key `stac-geoparquet` in the parquet file metadata.

See [`example-metadata.json`](https://github.com/stac-utils/stac-geoparquet/blob/main/spec/example-metadata.json) for an example.

A [jsonschema schema file][schema] is provided for tools to validate against.
Note that the json-schema for stac-geoparquet does *not* validate the
`collection` object against the STAC json-schema. You'll need to validate that
separately.


| Field Name | Type | Description |
| -------------| -----------------------| ----------------------------------------------------------------------- |
| `version` | string | The stac-geoparquet metadata version. Currently just "1.0.0" is allowed |
| `collection` | STAC Collection object | STAC Collection metadata. |

Note that this metadata is distinct from the file metadata required by
[geoparquet].

#### Geoparquet Version

The field `version` stores the version of the stac-geoparquet
specification the data complies with. Readers can use this field to understand what
features and fields are available.

Currently, the only allowed value is the string `"1.0.0"`.

Note: early versions of this specificaiton didn't include a `version` field. Readers
aiming for maximum compatibility may attempt to read files without this key present,
despite it being required from 1.0.0 onwards.

#### STAC Collection Object

To make a stac-geoparquet file a fully self-contained representation, you can
include the Collection JSON in the Parquet metadata. If present in the [Parquet
file metadata][parquet-metadata], the key must be `stac:collection` and the
value must be a JSON string with the Collection JSON.
include the Collection JSON document in the Parquet metadata under the
`collection` key. This should contain a STAC [Collection].

## Referencing a STAC Geoparquet Collections in a STAC Collection JSON

A common use case of stac-geoparquet is to create a mirror of a STAC collection. To refer to this mirror in the original collection, use an [Asset Object](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#asset-object) at the collection level of the STAC JSON that includes the `application/vnd.apache.parquet` Media type and `collection-mirror` Role type to describe the function of the Geoparquet STAC Collection Asset.

A common use case of stac-geoparquet is to create a mirror of a STAC collection. To refer to this mirror in the original collection, use an [Asset Object](https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#asset-object) at the collection level of the STAC JSON that includes the `application/vnd.apache.parquet` Media type and `collection-mirror` Role type to describe the function of the Geoparquet STAC Co
For example:

| Field Name | Type | Value |
Expand All @@ -105,3 +136,5 @@ The principles here can likely be used to map into other geospatial data formats
[common-media-types]: https://github.com/radiantearth/stac-spec/blob/master/best-practices.md#common-media-types-in-stac
[timestamp]: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
[parquet-metadata]: https://github.com/apache/parquet-format#metadata
[Collection]: https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#
[schema]: https://github.com/stac-utils/stac-geoparquet/blob/main/spec/json-schema/metadata.json
4 changes: 2 additions & 2 deletions stac_geoparquet/arrow/_delta_lake.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
from stac_geoparquet.arrow._to_parquet import (
DEFAULT_PARQUET_SCHEMA_VERSION,
SUPPORTED_PARQUET_SCHEMA_VERSIONS,
create_geoparquet_metadata,
create_parquet_metadata,
)

if TYPE_CHECKING:
Expand Down Expand Up @@ -51,7 +51,7 @@ def parse_stac_ndjson_to_delta_lake(
input_path, chunk_size=chunk_size, schema=schema, limit=limit
)
schema = record_batch_reader.schema.with_metadata(
create_geoparquet_metadata(
create_parquet_metadata(
record_batch_reader.schema, schema_version=schema_version
)
)
Expand Down
47 changes: 43 additions & 4 deletions stac_geoparquet/arrow/_to_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import json
from collections.abc import Iterable
from pathlib import Path
from typing import Any
from typing import Any, Literal

import pyarrow as pa
import pyarrow.parquet as pq
Expand All @@ -18,6 +18,9 @@
from stac_geoparquet.arrow._schema.models import InferredSchema
from stac_geoparquet.arrow.types import ArrowStreamExportable

STAC_GEOPARQUET_VERSION: Literal["1.0.0"] = "1.0.0"
STAC_GEOPARQUET_METADATA_KEY = b"stac-geoparquet"


def parse_stac_ndjson_to_parquet(
input_path: str | Path | Iterable[str | Path],
Expand All @@ -27,6 +30,7 @@ def parse_stac_ndjson_to_parquet(
schema: pa.Schema | InferredSchema | None = None,
limit: int | None = None,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
collection_metadata: dict[str, Any] | None = None,
**kwargs: Any,
) -> None:
"""Convert one or more newline-delimited JSON STAC files to GeoParquet
Expand All @@ -45,6 +49,9 @@ def parse_stac_ndjson_to_parquet(
limit: The maximum number of JSON records to convert.
schema_version: GeoParquet specification version; if not provided will default
to latest supported version.
collection_metadata: A dictionary representing a Collection in a SpatioTemporal
Asset Catalog. This will be stored under the key `stac-geoparquet` in the
parquet file metadata, under the key `collection`.

All other keyword args are passed on to
[`pyarrow.parquet.ParquetWriter`][pyarrow.parquet.ParquetWriter].
Expand All @@ -57,6 +64,7 @@ def parse_stac_ndjson_to_parquet(
output_path=output_path,
schema_version=schema_version,
**kwargs,
collection_metadata=collection_metadata,
)


Expand All @@ -65,6 +73,7 @@ def to_parquet(
output_path: str | Path,
*,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS = DEFAULT_PARQUET_SCHEMA_VERSION,
collection_metadata: dict[str, Any] | None = None,
**kwargs: Any,
) -> None:
"""Write an Arrow table with STAC data to GeoParquet
Expand All @@ -82,6 +91,9 @@ def to_parquet(
Keyword Args:
schema_version: GeoParquet specification version; if not provided will default
to latest supported version.
collection_metadata: A dictionary representing a Collection in a SpatioTemporal
Asset Catalog. This will be stored under the key `stac-geoparquet` in the
parquet file metadata, under the key `collection`.

All other keyword args are passed on to
[`pyarrow.parquet.ParquetWriter`][pyarrow.parquet.ParquetWriter].
Expand All @@ -90,17 +102,22 @@ def to_parquet(
reader = pa.RecordBatchReader.from_stream(table)

schema = reader.schema.with_metadata(
create_geoparquet_metadata(reader.schema, schema_version=schema_version)
create_parquet_metadata(
reader.schema,
schema_version=schema_version,
collection_metadata=collection_metadata,
)
)
with pq.ParquetWriter(output_path, schema, **kwargs) as writer:
for batch in reader:
writer.write_batch(batch)


def create_geoparquet_metadata(
def create_parquet_metadata(
schema: pa.Schema,
*,
schema_version: SUPPORTED_PARQUET_SCHEMA_VERSIONS,
collection_metadata: dict[str, Any] | None = None,
) -> dict[bytes, bytes]:
# TODO: include bbox of geometries
column_meta = {
Expand Down Expand Up @@ -141,7 +158,12 @@ def create_geoparquet_metadata(
"crs": None,
}

return {b"geo": json.dumps(geo_meta).encode("utf-8")}
geoparquet_metadata = create_stac_geoparquet_metadata(collection_metadata)

return {
b"geo": json.dumps(geo_meta).encode("utf-8"),
STAC_GEOPARQUET_METADATA_KEY: json.dumps(geoparquet_metadata).encode("utf-8"),
}


def schema_version_has_bbox_mapping(
Expand All @@ -152,3 +174,20 @@ def schema_version_has_bbox_mapping(
metadata.
"""
return int(schema_version.split(".")[1]) >= 1


def create_stac_geoparquet_metadata(
collection_metadata: dict[str, Any] | None = None,
) -> dict[str, Any]:
"""
Create the stac-geoparquet metadata object for the Parquet file.

This will be stored under the key `stac-geoparquet` in the Parquet file metadata.
It must be compatible with the metadata spec.
"""
result: dict[str, Any] = {
"version": STAC_GEOPARQUET_VERSION,
}
if collection_metadata:
result["collection"] = collection_metadata
return result
1 change: 1 addition & 0 deletions tests/data/3dep-lidar-copc-pc-collection.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"id":"3dep-lidar-copc","type":"Collection","links":[{"rel":"items","type":"application/geo+json","href":"https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-lidar-copc/items"},{"rel":"parent","type":"application/json","href":"https://planetarycomputer.microsoft.com/api/stac/v1/"},{"rel":"root","type":"application/json","href":"https://planetarycomputer.microsoft.com/api/stac/v1/"},{"rel":"self","type":"application/json","href":"https://planetarycomputer.microsoft.com/api/stac/v1/collections/3dep-lidar-copc"},{"rel":"license","href":"https://www.usgs.gov/3d-elevation-program/about-3dep-products-services","title":"About 3DEP Products & Services"},{"rel":"describedby","href":"https://planetarycomputer.microsoft.com/dataset/3dep-lidar-copc","title":"Human readable dataset overview and reference","type":"text/html"}],"title":"USGS 3DEP Lidar Point Cloud","assets":{"thumbnail":{"href":"https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/3dep-lidar-copc-thumbnail.png","type":"image/png","roles":["thumbnail"],"title":"Thumbnail"}},"extent":{"spatial":{"bbox":[[-166.8546920006028,17.655357747708283,-64.56116757979399,71.39330810146807],[144.60180842809473,13.21774453924126,146.08202179248926,18.18369664008955]]},"temporal":{"interval":[["2012-01-01T00:00:00Z","2022-01-01T00:00:00Z"]]}},"license":"proprietary","keywords":["USGS","3DEP","COG","Point cloud"],"providers":[{"name":"Landrush","roles":["processor","producer"]},{"url":"https://www.usgs.gov/core-science-systems/ngp/3dep/","name":"USGS","roles":["processor","producer","licensor"]},{"url":"https://planetarycomputer.microsoft.com","name":"Microsoft","roles":["host","processor"]}],"summaries":{"gsd":[2.0]},"description":"This collection contains source data from the [USGS 3DEP program](https://www.usgs.gov/3d-elevation-program) reformatted into the [COPC](https://copc.io) format. A COPC file is a LAZ 1.4 file that stores point data organized in a clustered octree. It contains a VLR that describes the octree organization of data that are stored in LAZ 1.4 chunks. The end product is a one-to-one mapping of LAZ to UTM-reprojected COPC files.\n\nLAZ data is geospatial [LiDAR point cloud](https://en.wikipedia.org/wiki/Point_cloud) (LPC) content stored in the compressed [LASzip](https://laszip.org?) format. Data were reorganized and stored in LAZ-compatible [COPC](https://copc.io) organization for use in Planetary Computer, which supports incremental spatial access and cloud streaming.\n\nLPC can be summarized for construction of digital terrain models (DTM), filtered for extraction of features like vegetation and buildings, and visualized to provide a point cloud map of the physical spaces the laser scanner interacted with. LPC content from 3DEP is used to compute and extract a variety of landscape characterization products, and some of them are provided by Planetary Computer, including Height Above Ground, Relative Intensity Image, and DTM and Digital Surface Models.\n\nThe LAZ tiles represent a one-to-one mapping of original tiled content as provided by the [USGS 3DEP program](https://www.usgs.gov/3d-elevation-program), with the exception that the data were reprojected and normalized into appropriate UTM zones for their location without adjustment to the vertical datum. In some cases, vertical datum description may not match actual data values, especially for pre-2010 USGS 3DEP point cloud data.\n\nIn addition to these COPC files, various higher-level derived products are available as Cloud Optimized GeoTIFFs in [other collections](https://planetarycomputer.microsoft.com/dataset/group/3dep-lidar).","item_assets":{"data":{"type":"application/vnd.laszip+copc","roles":["data"],"title":"COPC data","pc:type":"lidar","pc:encoding":"application/vnd.laszip+copc"},"thumbnail":{"type":"image/png","roles":["thumbnail"],"title":"3DEP Lidar COPC"}},"stac_version":"1.1.0","msft:group_id":"3dep-lidar","msft:container":"usgs-3dep-copc","stac_extensions":["https://stac-extensions.github.io/item-assets/v1.0.0/schema.json","https://stac-extensions.github.io/pointcloud/v1.0.0/schema.json"],"msft:storage_account":"usgslidareuwest","msft:short_description":"Nationwide Lidar point cloud data in COPC format.","msft:region":"westeurope"}
Loading