-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Since mid-October, the weekly EO STAC datasets harvesting has been incomplete, manifesting in reduced number of datasets on GEO.ca.
Logs
Thu 2025-10-16:
GEO.ca datasets count dropped from about 94589 to 57246.
Fri 2025-10-17
Initial investigations revealed that s3://webpresence-geocore-geojson-to-parquet-prod/sentinel1.parquet shrunk in size, containing only 35089 datasets, far fewer than the 72432 datasets contained in the 2025-01-28 backup.
- sentinel1_2025-01-28_0331.parquet: 72432 records
- sentinel1_2025-10-14_0409.parquet: 35089 records
Fortunately, reverting to the 2025-01-28 sentinel1.parquet backup brought the total datasets back up to 94589.
Tue 2025-10-21
After the scheduled Sentinel-1 harvesting on Tuesday, total datasets count dropped to 67650. Restored to 95080 datasets a few hours later.
- sentinel1_2025-01-28_0331.parquet: 72432 records
- sentinel1_2025-10-21_0411.parquet: 45002 records
Tue 2025-10-28
Total datasets count dropped to 86088, but not due to Sentinel-1 as it is still dated 2025-10-21, i.e. the scheduled 2025-10-28 harvesting did not complete. This time, it is rcm-ard.sentinel that is incomplete:
- rcm-ard_2025-10-14_0404.parquet: 12971 records
- rcm-ard_2025-10-21_0402.parquet: 13457 records
- rcm-ard_2025-10-28_0403.parquet: 4440 records
Reverting to the 2025-10-21 backup of rcm-ard.parquet restored the total datasets count to 95105.
Initial investigations
-
Upstream EO collections STAC server seems intact:
$ python3 stac-record-counter.py Querying STAC API for collection: rcm-ard... --- Result --- Collection: rcm-ard Total number of records (Items): 14,537 -------------- Querying STAC API for collection: sentinel-1... --- Result --- Collection: sentinel-1 Total number of records (Items): 75,087 --------------
-
Apparently, no sentinel1.parquet was generated between February 2025 and early October 2025.
-
collector.py seems OK (links for batches of 5000 items look OK in s3://eo-sg-datacube-item-links-prod)
-
(which step?) ends up with insufficent number of GeoJSON files.
-
GeoJSON-to-Parquet step: no problem.
Some hypotheses:
- RCM-ARD and Sentinel-1 harvesting running on the same day, i.e. Tuesday? It seems the RCM-ARD harvesting was intended to run on Monday.
- More error-checking, download retries, etc. (number of GeoJSON files downloaded should match the datasets count, etc.) might help?
Tools used:
Python scripts written with help from Google Gemini:
Metadata
Metadata
Assignees
Labels
Type
Projects
Status