-
Notifications
You must be signed in to change notification settings - Fork 120
Whitelisting Onelake API & Workspace PL FQDNs #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @SmritiAgrawal04
This PR needs some tests to show it working I think
crepererum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do somewhat agree with @alamb. I know it's close to impossible to write an integration test for this, but maybe we can at least have a unit test for parse_url?
src/azure/builder.rs
Outdated
| let first_label = host.split('.').next().unwrap_or_default(); | ||
| self.account_name = Some(validate(first_label)?); | ||
|
|
||
| let container = parsed.path_segments().unwrap().next().expect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: onelake uses workspace terminology
src/azure/builder.rs
Outdated
| // Regex to match WS-PL FQDN: "{workspaceid}.z??.dfs.fabric.microsoft.com" | ||
| // workspaceid = 32 hex chars, z?? = z + first two chars of workspaceid | ||
| lazy_static::lazy_static! { | ||
| static ref WS_PL_REGEX: Regex = Regex::new(r"^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(dfs|blob)\.fabric\.microsoft\.com$").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add support for .onelake.fabric.microsoft.com also
src/azure/builder.rs
Outdated
| let xy = captures.name("xy").unwrap().as_str(); | ||
|
|
||
| // Validate z?? matches first 2 chars of workspaceid | ||
| if &workspaceid[0..2] != xy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this validation
src/azure/builder.rs
Outdated
| } | ||
|
|
||
| // Otherwise, check Fabric global / Onelake API FQDN | ||
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are checking for global endpoint, should these not be with not like (!host.ends_with(DFS_FABRIC_SUFFIX) && !host.ends_with(BLOB_FABRIC_SUFFIX))
src/azure/builder.rs
Outdated
| if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) { | ||
| let labels: Vec<&str> = host.split('.').collect(); | ||
| let account_name = if labels.len() >= 2 && labels[0].contains("api") && labels[1] == "onelake" { | ||
| format!("{}-{}", labels[0], labels[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are referring workspace id as account_name, then this will not work for non pl scenario. In those case it will give account_name as "westus-api-onelake"
|
Hi @alamb & @crepererum, I added the unit tests as suggested. I request to review the PR please. |
src/azure/builder.rs
Outdated
| let mut builder = MicrosoftAzureBuilder::new(); | ||
| builder | ||
| .parse_url("https://account.blob.fabric.microsoft.com/container") | ||
| .parse_url("https://account.blob.fabric.microsoft.com/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this test case change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By mistake. Reverted.
| } | ||
| Some((a, "dfs.fabric.microsoft.com")) | Some((a, "blob.fabric.microsoft.com")) => { | ||
| self.account_name = Some(validate(a)?); | ||
| // Attempt to infer the container name from the URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this comment? It seems helpful
| } | ||
| "https" => { | ||
| // Regex to match WS-PL FQDN: | ||
| // "{workspaceid}.z??.(onelake|dfs|blob).fabric.microsoft.com" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please also add an example URL for each of the APIs you are adding support for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added for WS-PL DFS/ Blob endpoints. We are waiting for PM to confirm on ABFSS & WS-PL onelake domains.
| self.container_name = Some(validate(parsed.username())?); | ||
| self.account_name = Some(validate(a)?); | ||
| self.use_fabric_endpoint = true.into(); | ||
| } else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft doc so we can cite it in code/tests?
I don't see it in https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding I think. Is this a vaild url?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @SmritiAgrawal04 -- i left some more questions
BTW I asked codex for a review of this PR like this
please review this PR (checked out in this directory): https://github.com/apache/arrow-rs-object-store/pull/552
Specifically ensure that the format of the URLs supported matches publically published microsoft urls and provide documentation references
And this was the output which I found helpful
───────────────────────────────────────────────────────────────────────────────────────────────
• Findings
- High: -api.onelake.fabric.microsoft.com is accepted for both abfs(s) and https, but I can’t find this FQDN in
the public OneLake URL docs. The published endpoints are onelake.dfs.fabric.microsoft.com and workspace-scoped
https://{workspaceId}.z{xy}.dfs.fabric.microsoft.com/... (see https://learn.microsoft.com/en-us/fabric/onelake/
onelake-access-api). If -api.onelake is internal or not publicly documented, this should be gated or removed. - Medium: The WS‑PL regex no longer verifies that z{xy} matches the first two hex chars of the workspace id (src/
azure/builder.rs:692-708). The published WS‑PL format is https://
{workspaceId}.z{first2}.dfs.fabric.microsoft.com (same doc). Without the check, non‑published FQDNs are
accepted. - Medium: WS‑PL regex allows .onelake.fabric.microsoft.com (src/azure/builder.rs:692-699). The public docs list
dfs (and blob endpoints for blob access), but I don’t see onelake as a WS‑PL subdomain. Please confirm with a
Microsoft reference or remove to match published formats. - Low: WS‑PL regex only allows lowercase hex ([0-9a-f]{32}) (src/azure/builder.rs:697). If Microsoft treats
workspace IDs as case‑insensitive GUIDs in URLs, this will reject valid uppercase forms; consider documenting
that only lowercase is supported or make the regex case‑insensitive.
Doc references
- OneLake access API URL formats: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
- Private Link / workspace-specific OneLake endpoints (if this PR targets WS‑PL): please confirm the exact doc
section that defines the workspaceId.z{xy}.dfs.fabric.microsoft.com pattern and whether any onelake WS‑PL host
is documented (I could not find it in public docs).
Questions / assumptions
- Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft
doc so we can cite it in code/tests? - Should WS‑PL accept only dfs/blob subdomains, or is .onelake.fabric.microsoft.com explicitly published?
|
FYI #604 may be related. |
Hi @alamb, I have addressed all comments. About, finding #1 & #3, we plan to add it to the public documentation, the PR for which is already out. I request to approve these changes meanwhile. Thanks |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this @SmritiAgrawal04
I am sorry for the length of time but our review bandwdith is very limited here
Part of the reason it has taken so long to review is there is no ticket or description of what issue this is fixing, nor good examples of what new features you have added
In order to proceed, please:
- Fill out the PR description with a clear description of what problem you are solving. This shoud, at a minimum,, provide URLs that can't be parsed with the current crate, and examples of the expected output
- Provide a link to the documentation of the format of onelake URLs so that we can verify the URLs should be parsed
- Add a link to the docs / references in the code so that future readers can refer back to them
| self.container_name = Some(validate(parsed.username())?); | ||
| self.account_name = Some(validate(a)?); | ||
| self.use_fabric_endpoint = true.into(); | ||
| } else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is still outstanding I think. Is this a vaild url?
| itertools = "0.14.0" | ||
| parking_lot = { version = "0.12" } | ||
| percent-encoding = "2.1" | ||
| regex = "1.11.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used for azure, in order to keep the dependency chain minimal, can we please make this an optional dependency, following for example base64?
And then activate like this?
azure = ["cloud", "httparse"]
| builder | ||
| .parse_url("https://Ab000000000000000000000000000000.zAb.dfs.fabric.microsoft.com/") | ||
| .unwrap(); | ||
| assert_eq!(builder.account_name, Some("ab000000000000000000000000000000.zab".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the account name really supposed to have the ab000000000000000000000000000000 in it? It seems confusing that the container name is also ab000000000000000000000000000000
|
|
||
| let ws_pl_regex = WS_PL_REGEX.get_or_init(|| { | ||
| Regex::new( | ||
| r"(?i)^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(onelake|dfs|blob)\.fabric\.microsoft\.com$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the rationale for using regexp math here rather than just hostname splitting? The examples / tests you have added don't seem to test for invalid URLs / that don't follow a simple .... format
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?