Use a persistent location for caching data files #152

nikhilwoodruff · 2025-05-27T09:41:32Z

Fixes #151

anth-volk · 2025-05-27T18:05:19Z

Doing some research into this to better understand needs and options before reviewing

mikesmit

Unless there is a really strong benefit I would not bother with this right now.

Actually, side note, if we had integrated metrics in policyengine.py we could easily see how frequently we're downloading and how long it takes.

I had mentioned "instrumentation" as a key requirement for policyengine.py a while back this is a great example where I wanted to add it but didn't have a mechanism.

mikesmit · 2025-05-28T16:59:09Z

policyengine/utils/data/caching_google_storage_client.py

        self.client = SimplifiedGoogleStorageClient()
-        self.cache = diskcache.Cache()
+        cache_folder = user_cache_dir("policyengine.py")
+        self.cache = diskcache.Cache(directory=cache_folder)

    def _data_key(


I'm neutral to no on this PR. I don't think it ads much and it introduces complexity.

What does it actually do?
I think this isn't really going to have much of a performance impact.

The only time this will make a difference is if a process spins up and there is already a cache on disk and that cache already has a file. In that case it will re-use the cache.

There are only two reasons I think that happens

A process that was running crashes and uvicorn restarts it (this should be infrequent)

Multiple processes are running on the same container (currently we run 2) and this is the second one to try to get a specific file.

In all other cases we're spinning up a new container which does not share a disk with any other container.

Why I am not in a rush to do it
So I could have used a shared directory originally and I opted not to because

As noted above, the improvement is minimal (we download a file twice instead of once on any individual container)

The risk is also minimal, but annoying. This should work, but why throw more variables into "concurrency issues" mix by using a shared cache directory for multiple processes. If there are multi-process contention issues they'll be hard to find/recognize/address. The way it is now we just don't have to worry about it.

nikhilwoodruff · 2025-07-17T10:29:52Z

Re-upping because this realistically is a blocker to using policyengine.py for analysis- nobody wants to have 7 copies of the 100mb microdata polluting their file system

Use a persistent location for caching data files

75b604e

Fixes #151

nikhilwoodruff self-assigned this May 27, 2025

nikhilwoodruff requested review from anth-volk and mikesmit May 27, 2025 09:41

mikesmit reviewed May 28, 2025

View reviewed changes

nikhilwoodruff closed this Sep 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use a persistent location for caching data files #152

Use a persistent location for caching data files #152

Uh oh!

nikhilwoodruff commented May 27, 2025

Uh oh!

anth-volk commented May 27, 2025

Uh oh!

mikesmit left a comment

Uh oh!

mikesmit May 28, 2025

Uh oh!

nikhilwoodruff commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use a persistent location for caching data files #152

Use a persistent location for caching data files #152

Uh oh!

Conversation

nikhilwoodruff commented May 27, 2025

Uh oh!

anth-volk commented May 27, 2025

Uh oh!

mikesmit left a comment

Choose a reason for hiding this comment

Uh oh!

mikesmit May 28, 2025

Choose a reason for hiding this comment

Uh oh!

nikhilwoodruff commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants