Skip to content

Incomplete harvesting of EO STAC datasets? #1

@anthonyfok

Description

@anthonyfok

Since mid-October, the weekly EO STAC datasets harvesting has been incomplete, manifesting in reduced number of datasets on GEO.ca.

Logs

Thu 2025-10-16:

GEO.ca datasets count dropped from about 94589 to 57246.

Fri 2025-10-17

Initial investigations revealed that s3://webpresence-geocore-geojson-to-parquet-prod/sentinel1.parquet shrunk in size, containing only 35089 datasets, far fewer than the 72432 datasets contained in the 2025-01-28 backup.

  • sentinel1_2025-01-28_0331.parquet: 72432 records
  • sentinel1_2025-10-14_0409.parquet: 35089 records

Fortunately, reverting to the 2025-01-28 sentinel1.parquet backup brought the total datasets back up to 94589.

Tue 2025-10-21

After the scheduled Sentinel-1 harvesting on Tuesday, total datasets count dropped to 67650. Restored to 95080 datasets a few hours later.

  • sentinel1_2025-01-28_0331.parquet: 72432 records
  • sentinel1_2025-10-21_0411.parquet: 45002 records

Tue 2025-10-28

Total datasets count dropped to 86088, but not due to Sentinel-1 as it is still dated 2025-10-21, i.e. the scheduled 2025-10-28 harvesting did not complete. This time, it is rcm-ard.sentinel that is incomplete:

  • rcm-ard_2025-10-14_0404.parquet: 12971 records
  • rcm-ard_2025-10-21_0402.parquet: 13457 records
  • rcm-ard_2025-10-28_0403.parquet: 4440 records

Reverting to the 2025-10-21 backup of rcm-ard.parquet restored the total datasets count to 95105.

Initial investigations

  • Upstream EO collections STAC server seems intact:

    $ python3 stac-record-counter.py
    Querying STAC API for collection: rcm-ard...
    --- Result ---
    Collection: rcm-ard
    Total number of records (Items): 14,537
    --------------
    
    Querying STAC API for collection: sentinel-1...
    --- Result ---
    Collection: sentinel-1
    Total number of records (Items): 75,087
    --------------
  • Apparently, no sentinel1.parquet was generated between February 2025 and early October 2025.

  • collector.py seems OK (links for batches of 5000 items look OK in s3://eo-sg-datacube-item-links-prod)

  • (which step?) ends up with insufficent number of GeoJSON files.

  • GeoJSON-to-Parquet step: no problem.

Some hypotheses:

  • RCM-ARD and Sentinel-1 harvesting running on the same day, i.e. Tuesday? It seems the RCM-ARD harvesting was intended to run on Monday.
  • More error-checking, download retries, etc. (number of GeoJSON files downloaded should match the datasets count, etc.) might help?

Tools used:

Python scripts written with help from Google Gemini:

Metadata

Metadata

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions