Skip to content

Conversation

@ghidalgo3
Copy link

@ghidalgo3 ghidalgo3 commented Apr 28, 2025

This is an intermediate step necessary to rewrite the in-memory processing step to use Arrow.

@ghidalgo3 ghidalgo3 changed the title Support NDJSON export pgstac_reader NDJSON export Apr 28, 2025
Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, though some linting issues.

IMO, it'd be better to have a test that exercises the new format="ndjson" option than testing _write_ndjson directly.

@ghidalgo3
Copy link
Author

ghidalgo3 commented Apr 28, 2025

Agreed, though offline I'm chatting with @kylebarron on some more changes. I think this PR isn't finished.

For context, I'm trying to revive the OpenPC geoparquet exports, and Kyle I think wants to switch to using Arrow instead of using geopandas to write the parquet files. This NDJSON export in the end might be just an internal intermediate step that isn't exposed to clients.

The reason why the geoparquet export broke is:

  1. SFI networking rules broke reading from the collection config table, easy to fix
  2. The stac-geoparquet exports kept getting OOM killed, I think because we kept the STAC query and the geodataframe in memory at the same time and that put too much memory pressure on the machine doing the export for the larger partition.

I'll eventually consume this change in PCTasks when I revive the export.

@gadomski gadomski self-requested a review May 5, 2025 12:26
@ghidalgo3
Copy link
Author

@bitner 👀

import pystac
import shapely.wkb
import tqdm.auto
from tenacity import before_sleep_log, retry, stop_after_attempt, wait_fixed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new dependency? I don't see it in

dependencies = [
"ciso8601",
"deltalake",
"geopandas",
"packaging",
"pandas",
# Needed for RecordBatch.append_column
# Below 19 b/c https://github.com/apache/arrow/issues/45283
"pyarrow>=16,<19",
"pyproj",
"pystac",
"shapely",
"orjson",
'typing_extensions; python_version < "3.11"',
]

@bitner
Copy link
Collaborator

bitner commented May 6, 2025

@ghidalgo3 Let me mess around with a few things, we can definitely make this more streamable / memory friendly on the pg access side.

@bitner
Copy link
Collaborator

bitner commented May 6, 2025

didn't quite get there today, but I'll have something that cleans a lot of the postgres stuff up tomorrow

@bitner bitner mentioned this pull request May 8, 2025
@gadomski gadomski removed their request for review May 13, 2025 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants