-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix(gc): reuse fs in pull to avoid connections explosion #10888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements filesystem connection reuse in the Remote class to prevent connection exhaustion during garbage collection operations. When DVC pulls missing .dir
files individually during GC, it was creating separate filesystem connections for each file, overwhelming servers especially over SSH.
- Added filesystem caching mechanism using class-level cache and name-to-key mapping
- Implemented cache key generation based on remote name, filesystem class, configuration, and path
- Added proper cleanup logic to close unused filesystem connections
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
fs_config = dict(config) | ||
fs = cls(**fs_config) | ||
runtime_config = {**fs_config, "tmp_dir": self.repo.site_cache_dir} |
Copilot
AI
Oct 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separation of fs_config and runtime_config creates confusion about which configuration is used where. Consider using more descriptive variable names like filesystem_config and remote_config to clarify their purposes.
Copilot uses AI. Check for mistakes.
config: dict, | ||
fs_path: str, | ||
) -> tuple[str, ...]: | ||
serialized_config = json.dumps(config, sort_keys=True, default=str) |
Copilot
AI
Oct 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using default=str
in json.dumps may produce inconsistent serialization for complex objects. Objects of the same type could serialize differently depending on their string representation, leading to cache misses for equivalent configurations.
Copilot uses AI. Check for mistakes.
_CACHE: ClassVar[dict[tuple[str, ...], "FileSystem"]] = {} | ||
_NAME_TO_KEY: ClassVar[dict[str, tuple[str, ...]]] = {} |
Copilot
AI
Oct 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class-level caches without size limits or cleanup mechanisms can lead to memory leaks in long-running processes. Consider implementing cache size limits or periodic cleanup.
Copilot uses AI. Check for mistakes.
On GC we
pull
missing.dir
files one by one and thus we exhaust and overwhelm the server. Especially this is problematic on SSH. Her is an attempt to reuse the existingfs
(need to check if it will be enough to reuse pool or connection). I don't like this additional layer tbh, I can also probably simplify it further a bit - this is just a rough draft. If there are other ideas how to force it reuse the same pool - please let me know (may we on the fsspec level we can utilize some caching?)Note: AI was used here. Not every line is review yet. This is a draft to discuss.
cc @skshetry
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏