-
Notifications
You must be signed in to change notification settings - Fork 381
Adding support for Azure remote file hosting #2411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adding support for Azure remote file hosting #2411
Conversation
|
Note - I did some regression testing and found something odd with different versions of the azure abfs python integrations related to multiple azure tenant assignments that make the full implicit flows break. I'll be doing some more debugging to test if it's in this code or a bug in the api. This still works with SAS tokens, Keys, etc. |
Added troubleshooting tips for Azure log errors that will occur when the implicit flow does not know how to select a credential. Adding the logic to correct for this will be complicated and largely unnecessary given the available fallback methods.
After diagnosing and testing a bunch of fixes this isn't worth addressing here in my opinion. The failure occurs when users have multiple tenants or other settings that create multiple potential default credentials. Falling back to SAS is the best option and a note is added to the docs to that effect. |
jjallaire
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few changes requested.
|
|
||
| # Attempt to remove the test blob. Some Azure credentials (e.g. SAS without 'd' delete | ||
| # permission or managed identity with only Data Writer) can create but not delete. Treat | ||
| # that as writeable (we'll leave behind a tiny marker file that can be GC'd later). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we'll keep leaving these behind over and over again every time we call is_writeable(). How about if we use the same uuid every time so worst case there is one file left in the bucket?
| # note: S3 doesn't give you a directory modification time | ||
| if "mtime" not in file.keys() and file["type"] == "file": | ||
| file["mtime"] = self.fs.created(file).timestamp() | ||
| if "mtime" not in file.keys() and file["type"] == "file" and hasattr(self.fs, "created"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a brief comment here explaining why we do the created check and why we ignore the error below
| ) | ||
|
|
||
| options = deepcopy(DEFAULT_FS_OPTIONS.get(scheme, {})) | ||
| # Azure Blob / DataLake (adlfs) credential injection (lazy so dependency is optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you break this into a helper function that exists alongside the DEFAULT_FS_OPTIONS. In fact, I think we should just expose a default_fs_options(scheme) function that does the lookup in the map and then falls back to calling the azure function as required.
| if not fs.exists(log_dir): | ||
| fs.mkdir(log_dir, True) | ||
| log_dir = fs.info(log_dir).name | ||
| scheme = (log_dir.split("://", 1)[0].lower() if "://" in log_dir else "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use scheme = urlparse(log_dir).scheme for this
| fs.mkdir(log_dir, True) | ||
| log_dir = fs.info(log_dir).name | ||
| scheme = (log_dir.split("://", 1)[0].lower() if "://" in log_dir else "") | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should all be in a helper function so that reading it doesn't distract from the main flow
|
|
||
| def log_listing_response(logs: list[EvalLogInfo], log_dir: str) -> web.Response: | ||
| # Cleanly handle path names, TODO: add support for other cloud providers with multiple access points | ||
| def normalize_name(name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a helper function alongside the other azure-specific stuff you factor out from above. Generally I think we should have a _view/azure.py source file where all of this lives.
| if fs.is_async(): | ||
| async with async_fileystem(log_dir, fs_options=fs_options) as async_fs: | ||
| if await async_fs._exists(log_dir): | ||
| # Attempt existence check with robust handling for Azure-style auth issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. helper in azure.py file
| files: list[dict[str, Any]] = [] | ||
| async for _, _, filenames in async_fs._walk(log_dir, detail=True): | ||
| files.extend(filenames.values()) | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we want to actually detect that we are using a provider that we know can't handle detail and just enter that codepath expicitly rather than relying on a type error. Alternatively, I'd be fine just interrogating the _walk method to see if it takes a detail parameter.
|
Ack on requests - I'll be out for a week or so, so apologies in advance for the delay before I can get to these. |
|
I opened a follow up PR with the comments handled: My PR is based on the work in this PR. I addressed the open comments in the original PR and changed a few small parts. |
This PR contains:
What is the current behavior? (You can also link to an open issue here)
The current system supports local files and use of S3 (with some limitations)
What is the new behavior?
This adds support for multiple Azure file system connection points and also adds checking for edge cases of remote file systems generally.
Does this PR introduce a breaking change? (What changes might users need to make in their application due to this PR?)
This should not break anything but confirmation by someone with a robust existing S3 setup would be nice.
Other information:
I left flags for some of the edge case checks limited to only azure but they would likely benefit from being applied to other remote filesystems as well for resiliency. These areas are marked with TODOs for consideration.