Skip to content

Conversation

@ankitaluthra1
Copy link
Collaborator

@ankitaluthra1 ankitaluthra1 commented Nov 2, 2025

This pull request introduces base skeleton for adding different types of storages and would be behind experimental flag until full support for new storage types is added.

Key Changes:

  • GCSFileSystemAdapter: Added gcsfs/gcsfs_adapter.py to subclass GCSFileSystem. This adapter dynamically handles different bucket types using gcs storage layout api, directing operations to the appropriate file implementation.
  • gRPC Integration: Integrated the asynchronous gRPC client within the adapter and ZonalFile.
  • ZonalFile: Created gcsfs/zonal_file.py as a subclass of GCSFile. This class is specifically designed for interacting with GCS Zonal Buckets. The existing GCSFile from gcsfs.core continues to be used for all non-zonal bucket operations within the adapter, ensuring backward compatibility for standard bucket types.
  • Experimental Flag: The new Zonal Bucket/HNS features will only be enabled via an experimental_zb_hns_support flag passed to the GCSFileSystem constructor. The existing gcsfs.core.GCSFileSystem remains the foundation. When the experimental_zb_hns_support flag is not passed or is False, an instance of GCSFileSystem is created as usual, retaining all original functionality and read/write methods.
  • Utilities: Added gcsfs/zb_hns_utils.py for helper functions for gcsfs_adapter and zonal file.
  • Testing: Unit tests are added corresponding to newly added files.

ankitaluthra1 and others added 30 commits October 1, 2025 06:27
Extend gcsfs to create new filesystem. async download
Merged GCSFileSystemAdapter and ZonalFile POC
Rename gcshnsfilesystem to GCSFileSystemAdapter
Feat: Implement gRPC Read Path for Zonal Bucket
…t being used.

Updated zonal tests to work with both real and fake gcs
Added more tests for zonal path reads
ankitaluthra1 and others added 11 commits November 2, 2025 11:54
Feat: Zonal Read optimizations and test suite
* Add mrd closing logic in ZonalFile and GCSFSAdapter
Add exception handling in GCSFile fetch_range

* Fix gcs_adapter tests to use zonal mocks
add test to validate mrd stream is closed with GCSFile closing

* FIx: use patch for auth in gcs_adapter fixture for fake gcs only
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preliminary comments, not yet having looked through the main code of "adapter"

def zonal_mocks():
"""A factory fixture for mocking Zonal bucket functionality."""

@contextlib.contextmanager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make a context in a fixture, which already acts like a context?

Copy link
Contributor

@suni72 suni72 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inner function allows us to pass different file_data parameters to the mock for each specific call which fetches the correct rvalue for assertion. Also, Some tests has write operations. Since Zonal currently supports reads only, the context manager helps to apply the zonal mocks only during the read portion of the test. If the mocks were active for the entire test duration, the write operations would fail. This seemed the best way to achieve both of the requirements without duplicating mock setup code across every test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, following up on this. The context manager lets us cleanly scope the mocks to the read phase of the test. Does this approach seem reasonable, or should we switch to adding separate mocks for the tests with write?

@martindurant
Copy link
Member

What I am suggesting, is that import of the adapter modure should call register_implementation with the new class, replacing GCSFileSystem in the registry.

The environment variable would be used like:

if os.getenv(..):
    import gcsfs.adapter

in __init__.

For the sake of tests, you could either make a fixture that ensures the right class is registered for each test function/module, or make a separate CI run with the environment variable set.

…Test Mocking (#8)

* Add mrd closing logic in ZonalFile and GCSFSAdapter
Add exception handling in GCSFile fetch_range

* Fix gcs_adapter tests to use zonal mocks
add test to validate mrd stream is closed with GCSFile closing

* FIx: use patch for auth in gcs_adapter fixture for fake gcs only

* Change imports to absolute imports

* Rename GcsFileSystemAdapter to ExtendedGcsFileSystem

* Mock _sync_get_bucket_type call to return UNKNOWN while opening a file in write mode

* Updated ExtendedGcsFileSystem class description
Separated cleanup logic from extended_gcsfs fixture for better readability
@ankitaluthra1 ankitaluthra1 changed the title Feat: Introduce GCSFileSystemAdapter for Zonal Bucket gRPC Read Path Feat: Introduce ExtendedGcsFileSystem for Zonal Bucket gRPC Read Path Nov 12, 2025
* registers ExtendedGCSFileSystem when experimental variable is set instead of __new__ hack

* fixes registry race condition

* removes unnecessary registry

* adds separate ci tests

* adds test_init.py

* fixes pre-commit
ankitaluthra1 and others added 3 commits November 14, 2025 16:51
removed redundant environment variable setting
removed try block for creating TEST_BUCKET to expose errors in test setup
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review before merging this. Most question - you can mark comments in the code, if you intend to come back and change things in a future PR.

One question: can bucket types change? Do we need an explicit way to clear the bucket type cache?

)
# Fallback to core GCSFileSystem, do not register here

__all__ = ["GCSFileSystem", "GCSMap"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be important, but GCSMap will still refer to the original GCSFileSystem regardless of the branch taken above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently zonal feature is defaulted to false, so this shouldn't cause any issue. I've added a todo to handle this use case in future PRs once we have more clarity on this

Comment on lines +20 to +22
self.mrd = asyn.sync(
self.gcsfs.loop, self._init_mrd, self.bucket, self.key, self.generation
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to defer this until we're actually writing something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mrd is short for Multi Range Downloader which uses gRPC to download objects and it cannot be used for writing data. mrd is initialized (underlying stream is created and opened) during ZonalFile.open() so that the mrd is ready to use for reading after opening ZonalFile.

Comment on lines +52 to +54
if self.mrd:
asyn.sync(self.gcsfs.loop, self.mrd.close)
super().close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't super().close() potentially cause a write? So the .mrd should be closed second, not first. Maybe this is not important when we are only implementing read mode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the previous comment, mrd is used for reads/ downloads only and not for writing. Hence, mrd can be closed first without any issue.

ankitaluthra1 and others added 2 commits November 18, 2025 12:08
* remove redundant create mrd method from zb_hns_utils

* Added mrd to _cat_file method signature and added description

* Optimized _process_limits_to_offset_and_length method to only call info when size is None.

* Add todo to update GCSMap to use correct implementation
@martindurant
Copy link
Member

Please let me know when you are done with my comments, and I shall merge.

@ankitaluthra1
Copy link
Collaborator Author

Please let me know when you are done with my comments, and I shall merge.

Done with comments, PTAL

@martindurant martindurant merged commit 47c2d76 into fsspec:main Nov 18, 2025
6 checks passed
@martindurant
Copy link
Member

@ankitaluthra1 - this PR is causing the CI in fsspec to fail ( fsspec/filesystem_spec#1944 ). It seems extended (mrd) tests are being run in combination with the original GCSFileSystem. Can you please investigate?

@ankitaluthra1
Copy link
Collaborator Author

ankitaluthra1 commented Nov 20, 2025

@ankitaluthra1 - this PR is causing the CI in fsspec to fail ( fsspec/filesystem_spec#1944 ). It seems extended (mrd) tests are being run in combination with the original GCSFileSystem. Can you please investigate?

Added the fix, PTAL - #712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants