[Question] Memory management in long lived processes

Hi. We're using pystac and pystac-client in our applications to build out production data ingestion pipelines from Planetary Computer and other STAC backends. Something we have noticed is that the memory footprint of the library is large, and grows linearly over time, often quite quickly. For typical workflows in our system, memory growth could be as high as 100MB/min/worker process, which quickly consumes all available memory.

I spent some time last week trying to figure out why, and looking closely at both this project and the parent `pystac` codebase, it seems the root issue is the resolved object caches in pystac Catalogs (of which the clients are subclasses), and the tightly connected object graphs they house. From what I can see, there is no way to opt out of the cache, or to bound its growth. Our workarounds to manage the memory growth have been to either quarantine `pystac` in short-lived worker subprocesses that are periodically refreshed, or to manually [nuke the client and all of its members](https://github.com/allenai/rslearn/blob/4e833872811bdd92e181b7e5eb83994c72c243cd/rslearn/data_sources/planetary_computer.py#L168-L203) at regular interval. Neither of these is particularly ideal.

I did not readily find any discussion about managing memory growth with pystac or pystac client, but it seems like it would be a big problem for anyone working with large catalogs. I'm wondering if we're just missing an option or flag to help deal with this. 

From what I have seen of the library implementation, however, the object graph and resolved object cache is core to the design and functionality of the whole system. In what little I have looked, I have wondered what the consequence would be of having the client publish `root=None` [when](https://github.com/stac-utils/pystac-client/blob/main/pystac_client/item_search.py#L846-L859) [deserializing](https://github.com/stac-utils/pystac-client/blob/main/pystac_client/item_search.py#L778-L787) [results](https://github.com/stac-utils/pystac-client/blob/main/pystac_client/item_search.py#L801-L814). This would opt out of caching altogether (I think?).

In terms of bounding growth rather than opting out altogether, implementing a LRUC with eviction policy would be nice but seems very difficult at the pystac graph layer. Ensuring the graph isn't corrupted during eviction, handling hierarchical caches re: catalog merges, and how item identity is calculated would all be complex and distributed across the underlying `pystac` codebase. I wonder if `pystac-client` could offer its own HTTP response cache mechanism with LRUC that allows caching the IO but would repeat the deserialization and graph assembly work from request to request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Question] Memory management in long lived processes #842

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question] Memory management in long lived processes #842

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions