Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions changelog_entry.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
- bump: patch
changes:
fixed:
- Persistent cache location for data files.
4 changes: 3 additions & 1 deletion policyengine/utils/data/caching_google_storage_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import logging
from .simplified_google_storage_client import SimplifiedGoogleStorageClient
from typing import Optional
from platformdirs import user_cache_dir

logger = logging.getLogger(__name__)

Expand All @@ -17,7 +18,8 @@ class CachingGoogleStorageClient(AbstractContextManager):

def __init__(self):
self.client = SimplifiedGoogleStorageClient()
self.cache = diskcache.Cache()
cache_folder = user_cache_dir("policyengine.py")
self.cache = diskcache.Cache(directory=cache_folder)

def _data_key(
Comment on lines -20 to 24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm neutral to no on this PR. I don't think it ads much and it introduces complexity.

What does it actually do?
I think this isn't really going to have much of a performance impact.

The only time this will make a difference is if a process spins up and there is already a cache on disk and that cache already has a file. In that case it will re-use the cache.

There are only two reasons I think that happens

  1. A process that was running crashes and uvicorn restarts it (this should be infrequent)
  2. Multiple processes are running on the same container (currently we run 2) and this is the second one to try to get a specific file.

In all other cases we're spinning up a new container which does not share a disk with any other container.

Why I am not in a rush to do it
So I could have used a shared directory originally and I opted not to because

  1. As noted above, the improvement is minimal (we download a file twice instead of once on any individual container)
  2. The risk is also minimal, but annoying. This should work, but why throw more variables into "concurrency issues" mix by using a shared cache directory for multiple processes. If there are multi-process contention issues they'll be hard to find/recognize/address. The way it is now we just don't have to worry about it.

self, bucket: str, key: str, version: Optional[str] = None
Expand Down
Loading