Skip to content

Commit ea6c54c

Browse files
liamlegoghidalgo3liammorrison0Copilot
authored
stac-geoparquet exports fixes (#342)
* Fix stac_geoparquet export * Fix HLS2 collection metadata * Update to latest stac-geoparquet * Consume latest version now without cast * Use pgstac partitioning for partitioned collections * save * update stac-geoparquet partitioned export and usage of stac-geoparquet * run off of bitners commit * update to not use pgstac_to_arrow * ensure passing of tmpdir * some changes based on latest main from stac-geoparquet * export of all collections * remove comments from dockerfile * Apply suggestions from code review committing copilot suggestions Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Gustavo Hidalgo <zambrano.hidalgo@gmail.com> Co-authored-by: Liam Morrison <liammorrison@microsoft.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
1 parent b32c39b commit ea6c54c

File tree

9 files changed

+1859
-109
lines changed

9 files changed

+1859
-109
lines changed

datasets/hls2/collection/hls2-l30/template.json

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,14 @@
44
"id": "hls2-l30",
55
"title": "Harmonized Landsat Sentinel-2 (HLS) Version 2.0, Landsat Data",
66
"description": "{{ collection.description }}",
7-
"license": "Data Citation Guidance: https://lpdaac.usgs.gov/data/data-citations-and-guidelines",
8-
"links": [],
7+
"license": "proprietary",
8+
"links": [
9+
{
10+
"rel": "license",
11+
"href": "https://lpdaac.usgs.gov/data/data-citation-and-policies/",
12+
"title": "LP DAAC - Data Citation and Policies"
13+
}
14+
],
915
"stac_extensions": [
1016
"https://stac-extensions.github.io/item-assets/v1.0.0/schema.json",
1117
"https://stac-extensions.github.io/table/v1.2.0/schema.json",
@@ -49,6 +55,22 @@
4955
"type": "image/webp",
5056
"href": "https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/hls2-l30.webp",
5157
"title": "HLS2 Landsat Collection Thumbnail"
58+
},
59+
"geoparquet-items": {
60+
"href": "abfs://items/hls2-l30.parquet",
61+
"type": "application/x-parquet",
62+
"roles": [
63+
"stac-items"
64+
],
65+
"title": "GeoParquet STAC items",
66+
"description": "Snapshot of the collection's STAC items exported to GeoParquet format.",
67+
"msft:partition_info": {
68+
"is_partitioned": true,
69+
"partition_frequency": "W-MON"
70+
},
71+
"table:storage_options": {
72+
"account_name": "pcstacitems"
73+
}
5274
}
5375
},
5476
"summaries": {

datasets/hls2/collection/hls2-s30/template.json

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,14 @@
44
"id": "hls2-s30",
55
"title": "Harmonized Landsat Sentinel-2 (HLS) Version 2.0, Sentinel-2 Data",
66
"description": "{{ collection.description }}",
7-
"license": "Data Citation Guidance: https://lpdaac.usgs.gov/data/data-citations-and-guidelines",
8-
"links": [],
7+
"license": "proprietary",
8+
"links": [
9+
{
10+
"rel": "license",
11+
"href": "https://lpdaac.usgs.gov/data/data-citation-and-policies/",
12+
"title": "LP DAAC - Data Citation and Policies"
13+
}
14+
],
915
"stac_extensions": [
1016
"https://stac-extensions.github.io/item-assets/v1.0.0/schema.json",
1117
"https://stac-extensions.github.io/table/v1.2.0/schema.json",
@@ -56,6 +62,22 @@
5662
"type": "image/webp",
5763
"href": "https://ai4edatasetspublicassets.blob.core.windows.net/assets/pc_thumbnails/hls2-s30.webp",
5864
"title": "HLS2 Sentinel Collection Thumbnail"
65+
},
66+
"geoparquet-items": {
67+
"href": "abfs://items/hls2-s30.parquet",
68+
"type": "application/x-parquet",
69+
"roles": [
70+
"stac-items"
71+
],
72+
"title": "GeoParquet STAC items",
73+
"description": "Snapshot of the collection's STAC items exported to GeoParquet format.",
74+
"msft:partition_info": {
75+
"is_partitioned": true,
76+
"partition_frequency": "W-MON"
77+
},
78+
"table:storage_options": {
79+
"account_name": "pcstacitems"
80+
}
5981
}
6082
},
6183
"summaries": {

datasets/stac-geoparquet/Dockerfile

Lines changed: 12 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,72 +1,46 @@
1-
FROM ubuntu:20.04
1+
FROM mcr.microsoft.com/azurelinux/base/python:3.12
22

33
# Setup timezone info
44
ENV TZ=UTC
55

66
ENV LC_ALL=C.UTF-8
77
ENV LANG=C.UTF-8
8+
ENV UV_SYSTEM_PYTHON=TRUE
89

910
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
1011

11-
RUN apt-get update && apt-get install -y software-properties-common
12+
RUN tdnf install build-essential jq unzip ca-certificates awk wget curl git azure-cli -y \
13+
&& tdnf clean all
1214

13-
RUN add-apt-repository ppa:ubuntugis/ppa && \
14-
apt-get update && \
15-
apt-get install -y build-essential python3-dev python3-pip \
16-
jq unzip ca-certificates wget curl git && \
17-
apt-get autoremove && apt-get autoclean && apt-get clean
18-
19-
RUN update-alternatives --install /usr/bin/python python /usr/bin/python3 10
20-
21-
# See https://github.com/mapbox/rasterio/issues/1289
22-
ENV CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
23-
24-
# Install Python 3.11
25-
RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh" \
26-
&& bash "Mambaforge-$(uname)-$(uname -m).sh" -b -p /opt/conda \
27-
&& rm -rf "Mambaforge-$(uname)-$(uname -m).sh"
28-
29-
ENV PATH /opt/conda/bin:$PATH
30-
ENV LD_LIBRARY_PATH /opt/conda/lib/:$LD_LIBRARY_PATH
31-
32-
RUN mamba install -y -c conda-forge python=3.11 gdal pip setuptools cython numpy
33-
34-
RUN python -m pip install --upgrade pip
15+
# RUN python3 -m pip install --upgrade pip
16+
RUN pip install --upgrade uv
3517

3618
# Install common packages
3719
COPY requirements-task-base.txt /tmp/requirements.txt
38-
RUN python -m pip install --no-build-isolation -r /tmp/requirements.txt
20+
RUN uv pip install --no-build-isolation -r /tmp/requirements.txt
3921

4022
#
4123
# Copy and install packages
4224
#
4325

4426
COPY pctasks/core /opt/src/pctasks/core
4527
RUN cd /opt/src/pctasks/core && \
46-
pip install .
28+
uv pip install .
4729

4830
COPY pctasks/cli /opt/src/pctasks/cli
4931
RUN cd /opt/src/pctasks/cli && \
50-
pip install .
32+
uv pip install .
5133

5234
COPY pctasks/task /opt/src/pctasks/task
5335
RUN cd /opt/src/pctasks/task && \
54-
pip install .
36+
uv pip install .
5537

5638
COPY pctasks/client /opt/src/pctasks/client
5739
RUN cd /opt/src/pctasks/client && \
58-
pip install .
59-
60-
# COPY pctasks/ingest /opt/src/pctasks/ingest
61-
# RUN cd /opt/src/pctasks/ingest && \
62-
# pip install .
63-
64-
# COPY pctasks/dataset /opt/src/pctasks/dataset
65-
# RUN cd /opt/src/pctasks/dataset && \
66-
# pip install .
40+
uv pip install .
6741

6842
COPY datasets/stac-geoparquet /opt/src/datasets/stac-geoparquet
69-
RUN python3 -m pip install -r /opt/src/datasets/stac-geoparquet/requirements.txt
43+
RUN uv pip install -r /opt/src/datasets/stac-geoparquet/requirements.txt
7044

7145
# Setup Python Path to allow import of test modules
7246
ENV PYTHONPATH=/opt/src:$PYTHONPATH

datasets/stac-geoparquet/README.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,58 @@ Generates the `stac-geoparquet` collection-level assets for the [Planetary Compu
44

55
## Container Images
66

7+
Test the build with;
78
```shell
8-
$ az acr build -r pccomponents -t pctasks-stac-geoparquet:latest -t pctasks-stac-geoparquet:2023.7.10.0 -f datasets/stac-geoparquet/Dockerfile .
9+
docker build -t stac-geoparquet -f datasets/stac-geoparquet/Dockerfile .
10+
```
11+
12+
Then publish to the ACR with:
13+
14+
```shell
15+
az acr build -r pccomponents -t pctasks-stac-geoparquet:latest -t pctasks-stac-geoparquet:2023.7.10.0 -f datasets/stac-geoparquet/Dockerfile .
916
```
1017

1118
## Permissions
1219

1320
This requires the following permissions
1421

15-
* Storage Data Table Reader on the config tables (`pcapi/bluecollectoinconfig`, `pcapi/greencollectionconfig`)
22+
* Storage Data Table Reader on the config tables (`pcapi/bluecollectionconfig`, `pcapi/greencollectionconfig`)
1623
* Storage Blob Data Contributor on the `pcstacitems` container.
1724

1825
## Arguments
26+
1927
By default, this workflow will generate geoparquet assets for all collections.
2028
If you want to select a subset of collections, you can use either:
29+
2130
1. `extra_skip`: This will skip certain collections
2231
1. `collections`: This will only generate geoparquet for the specified collection(s).
2332

2433
## Updates
2534

2635
The workflow used for updates was registered with
2736

37+
```shell
38+
pctasks workflow update datasets/stac-geoparquet/workflow.yaml
39+
```
40+
41+
It can be manually invoked with:
42+
43+
```shell
44+
pctasks workflow submit stac-geoparquet
2845
```
29-
pctasks workflow update datasets/workflows/stac-geoparquet.yaml
30-
```
46+
47+
## Run Locally
48+
49+
You can debug the geoparquet export locally like this:
50+
51+
```shell
52+
export STAC_GEOPARQUET_CONNECTION_INFO="secret"
53+
export STAC_GEOPARQUET_TABLE_NAME="greencollectionconfig"
54+
export STAC_GEOPARQUET_TABLE_ACCOUNT_URL="https://pcapi.table.core.windows.net"
55+
export STAC_GEOPARQUET_STORAGE_OPTIONS_ACCOUNT_NAME="pcstacitems"
56+
57+
python3 pc_stac_geoparquet.py --collection hls2-l30
58+
```
59+
60+
Apart from the Postgres connection string, you will need PIM activations for
61+
`Storage Blob Data Contributor` to be able to write to the production storage account.

0 commit comments

Comments
 (0)