Skip to content

[Question] Memory management in long lived processes #842

@cmwilhelm

Description

@cmwilhelm

Hi. We're using pystac and pystac-client in our applications to build out production data ingestion pipelines from Planetary Computer and other STAC backends. Something we have noticed is that the memory footprint of the library is large, and grows linearly over time, often quite quickly. For typical workflows in our system, memory growth could be as high as 100MB/min/worker process, which quickly consumes all available memory.

I spent some time last week trying to figure out why, and looking closely at both this project and the parent pystac codebase, it seems the root issue is the resolved object caches in pystac Catalogs (of which the clients are subclasses), and the tightly connected object graphs they house. From what I can see, there is no way to opt out of the cache, or to bound its growth. Our workarounds to manage the memory growth have been to either quarantine pystac in short-lived worker subprocesses that are periodically refreshed, or to manually nuke the client and all of its members at regular interval. Neither of these is particularly ideal.

I did not readily find any discussion about managing memory growth with pystac or pystac client, but it seems like it would be a big problem for anyone working with large catalogs. I'm wondering if we're just missing an option or flag to help deal with this.

From what I have seen of the library implementation, however, the object graph and resolved object cache is core to the design and functionality of the whole system. In what little I have looked, I have wondered what the consequence would be of having the client publish root=None when deserializing results. This would opt out of caching altogether (I think?).

In terms of bounding growth rather than opting out altogether, implementing a LRUC with eviction policy would be nice but seems very difficult at the pystac graph layer. Ensuring the graph isn't corrupted during eviction, handling hierarchical caches re: catalog merges, and how item identity is calculated would all be complex and distributed across the underlying pystac codebase. I wonder if pystac-client could offer its own HTTP response cache mechanism with LRUC that allows caching the IO but would repeat the deserialization and graph assembly work from request to request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions