Skip to content

Conversation

shcheklein
Copy link
Member

@shcheklein shcheklein commented Oct 10, 2025

On GC we pull missing .dir files one by one and thus we exhaust and overwhelm the server. Especially this is problematic on SSH. Her is an attempt to reuse the existing fs (need to check if it will be enough to reuse pool or connection). I don't like this additional layer tbh, I can also probably simplify it further a bit - this is just a rough draft. If there are other ideas how to force it reuse the same pool - please let me know (may we on the fsspec level we can utilize some caching?)

Note: AI was used here. Not every line is review yet. This is a draft to discuss.

cc @skshetry


Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@github-project-automation github-project-automation bot moved this to Backlog in DVC Oct 10, 2025
@shcheklein shcheklein added enhancement Enhances DVC p1-important Important, aka current backlog of things to do labels Oct 10, 2025
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements filesystem connection reuse in the Remote class to prevent connection exhaustion during garbage collection operations. When DVC pulls missing .dir files individually during GC, it was creating separate filesystem connections for each file, overwhelming servers especially over SSH.

  • Added filesystem caching mechanism using class-level cache and name-to-key mapping
  • Implemented cache key generation based on remote name, filesystem class, configuration, and path
  • Added proper cleanup logic to close unused filesystem connections

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +159 to +161
fs_config = dict(config)
fs = cls(**fs_config)
runtime_config = {**fs_config, "tmp_dir": self.repo.site_cache_dir}
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separation of fs_config and runtime_config creates confusion about which configuration is used where. Consider using more descriptive variable names like filesystem_config and remote_config to clarify their purposes.

Copilot uses AI. Check for mistakes.

config: dict,
fs_path: str,
) -> tuple[str, ...]:
serialized_config = json.dumps(config, sort_keys=True, default=str)
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using default=str in json.dumps may produce inconsistent serialization for complex objects. Objects of the same type could serialize differently depending on their string representation, leading to cache misses for equivalent configurations.

Copilot uses AI. Check for mistakes.

Comment on lines +24 to +25
_CACHE: ClassVar[dict[tuple[str, ...], "FileSystem"]] = {}
_NAME_TO_KEY: ClassVar[dict[str, tuple[str, ...]]] = {}
Copy link

Copilot AI Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class-level caches without size limits or cleanup mechanisms can lead to memory leaks in long-running processes. Consider implementing cache size limits or periodic cleanup.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhances DVC p1-important Important, aka current backlog of things to do

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant