Skip to content

Conversation

@nikhilwoodruff
Copy link
Collaborator

Fixes #151

@nikhilwoodruff nikhilwoodruff self-assigned this May 27, 2025
@anth-volk
Copy link
Contributor

Doing some research into this to better understand needs and options before reviewing

Copy link
Collaborator

@mikesmit mikesmit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a really strong benefit I would not bother with this right now.

Actually, side note, if we had integrated metrics in policyengine.py we could easily see how frequently we're downloading and how long it takes.

I had mentioned "instrumentation" as a key requirement for policyengine.py a while back this is a great example where I wanted to add it but didn't have a mechanism.

Comment on lines -20 to 24
self.client = SimplifiedGoogleStorageClient()
self.cache = diskcache.Cache()
cache_folder = user_cache_dir("policyengine.py")
self.cache = diskcache.Cache(directory=cache_folder)

def _data_key(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm neutral to no on this PR. I don't think it ads much and it introduces complexity.

What does it actually do?
I think this isn't really going to have much of a performance impact.

The only time this will make a difference is if a process spins up and there is already a cache on disk and that cache already has a file. In that case it will re-use the cache.

There are only two reasons I think that happens

  1. A process that was running crashes and uvicorn restarts it (this should be infrequent)
  2. Multiple processes are running on the same container (currently we run 2) and this is the second one to try to get a specific file.

In all other cases we're spinning up a new container which does not share a disk with any other container.

Why I am not in a rush to do it
So I could have used a shared directory originally and I opted not to because

  1. As noted above, the improvement is minimal (we download a file twice instead of once on any individual container)
  2. The risk is also minimal, but annoying. This should work, but why throw more variables into "concurrency issues" mix by using a shared cache directory for multiple processes. If there are multi-process contention issues they'll be hard to find/recognize/address. The way it is now we just don't have to worry about it.

@nikhilwoodruff
Copy link
Collaborator Author

Re-upping because this realistically is a blocker to using policyengine.py for analysis- nobody wants to have 7 copies of the 100mb microdata polluting their file system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use a persistent location for caching data files

4 participants