Skip to content

Conversation

@stevetru1
Copy link

This PR contains:

  • [ X ] New features
  • Changes to dev-tools e.g. CI config / github tooling
  • [ X ] Docs
  • Bug fixes
  • Code refactor

What is the current behavior? (You can also link to an open issue here)

The current system supports local files and use of S3 (with some limitations)

What is the new behavior?

This adds support for multiple Azure file system connection points and also adds checking for edge cases of remote file systems generally.

Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)

This should not break anything but confirmation by someone with a robust existing S3 setup would be nice.

Other information:

I left flags for some of the edge case checks limited to only azure but they would likely benefit from being applied to other remote filesystems as well for resiliency. These areas are marked with TODOs for consideration.

@stevetru1
Copy link
Author

Note - I did some regression testing and found something odd with different versions of the azure abfs python integrations related to multiple azure tenant assignments that make the full implicit flows break. I'll be doing some more debugging to test if it's in this code or a bug in the api. This still works with SAS tokens, Keys, etc.

Added troubleshooting tips for Azure log errors that will occur when the implicit flow does not know how to select a credential. Adding the logic to correct for this will be complicated and largely unnecessary given the available fallback methods.
@stevetru1
Copy link
Author

Note - I did some regression testing and found something odd with different versions of the azure abfs python integrations related to multiple azure tenant assignments that make the full implicit flows break. I'll be doing some more debugging to test if it's in this code or a bug in the api. This still works with SAS tokens, Keys, etc.

After diagnosing and testing a bunch of fixes this isn't worth addressing here in my opinion. The failure occurs when users have multiple tenants or other settings that create multiple potential default credentials. Falling back to SAS is the best option and a note is added to the docs to that effect.

Copy link
Collaborator

@jjallaire jjallaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few changes requested.


# Attempt to remove the test blob. Some Azure credentials (e.g. SAS without 'd' delete
# permission or managed identity with only Data Writer) can create but not delete. Treat
# that as writeable (we'll leave behind a tiny marker file that can be GC'd later).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we'll keep leaving these behind over and over again every time we call is_writeable(). How about if we use the same uuid every time so worst case there is one file left in the bucket?

# note: S3 doesn't give you a directory modification time
if "mtime" not in file.keys() and file["type"] == "file":
file["mtime"] = self.fs.created(file).timestamp()
if "mtime" not in file.keys() and file["type"] == "file" and hasattr(self.fs, "created"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a brief comment here explaining why we do the created check and why we ignore the error below

)

options = deepcopy(DEFAULT_FS_OPTIONS.get(scheme, {}))
# Azure Blob / DataLake (adlfs) credential injection (lazy so dependency is optional)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you break this into a helper function that exists alongside the DEFAULT_FS_OPTIONS. In fact, I think we should just expose a default_fs_options(scheme) function that does the lookup in the map and then falls back to calling the azure function as required.

if not fs.exists(log_dir):
fs.mkdir(log_dir, True)
log_dir = fs.info(log_dir).name
scheme = (log_dir.split("://", 1)[0].lower() if "://" in log_dir else "")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use scheme = urlparse(log_dir).scheme for this

fs.mkdir(log_dir, True)
log_dir = fs.info(log_dir).name
scheme = (log_dir.split("://", 1)[0].lower() if "://" in log_dir else "")
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be in a helper function so that reading it doesn't distract from the main flow


def log_listing_response(logs: list[EvalLogInfo], log_dir: str) -> web.Response:
# Cleanly handle path names, TODO: add support for other cloud providers with multiple access points
def normalize_name(name: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a helper function alongside the other azure-specific stuff you factor out from above. Generally I think we should have a _view/azure.py source file where all of this lives.

if fs.is_async():
async with async_fileystem(log_dir, fs_options=fs_options) as async_fs:
if await async_fs._exists(log_dir):
# Attempt existence check with robust handling for Azure-style auth issues.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. helper in azure.py file

files: list[dict[str, Any]] = []
async for _, _, filenames in async_fs._walk(log_dir, detail=True):
files.extend(filenames.values())
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we want to actually detect that we are using a provider that we know can't handle detail and just enter that codepath expicitly rather than relying on a type error. Alternatively, I'd be fine just interrogating the _walk method to see if it takes a detail parameter.

@stevetru1
Copy link
Author

Ack on requests - I'll be out for a week or so, so apologies in advance for the delay before I can get to these.

@mgtk77
Copy link
Contributor

mgtk77 commented Dec 4, 2025

I opened a follow up PR with the comments handled:
#2866

My PR is based on the work in this PR. I addressed the open comments in the original PR and changed a few small parts.
A colleague of mine talked with @stevetru1 and he said he no longer needs the changes in his PR and he has no time to handle the comments. Since my team needs this, I'm trying to complete this work.
Credit wise, it should be co-authored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants