Skip to content

Conversation

@SmritiAgrawal04
Copy link

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @SmritiAgrawal04

This PR needs some tests to show it working I think

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do somewhat agree with @alamb. I know it's close to impossible to write an integration test for this, but maybe we can at least have a unit test for parse_url?

@tustvold tustvold marked this pull request as draft December 13, 2025 13:22
let first_label = host.split('.').next().unwrap_or_default();
self.account_name = Some(validate(first_label)?);

let container = parsed.path_segments().unwrap().next().expect(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: onelake uses workspace terminology

// Regex to match WS-PL FQDN: "{workspaceid}.z??.dfs.fabric.microsoft.com"
// workspaceid = 32 hex chars, z?? = z + first two chars of workspaceid
lazy_static::lazy_static! {
static ref WS_PL_REGEX: Regex = Regex::new(r"^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(dfs|blob)\.fabric\.microsoft\.com$").unwrap();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add support for .onelake.fabric.microsoft.com also

let xy = captures.name("xy").unwrap().as_str();

// Validate z?? matches first 2 chars of workspaceid
if &workspaceid[0..2] != xy {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this validation

}

// Otherwise, check Fabric global / Onelake API FQDN
if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are checking for global endpoint, should these not be with not like (!host.ends_with(DFS_FABRIC_SUFFIX) && !host.ends_with(BLOB_FABRIC_SUFFIX))

if host.ends_with(DFS_FABRIC_SUFFIX) || host.ends_with(BLOB_FABRIC_SUFFIX) {
let labels: Vec<&str> = host.split('.').collect();
let account_name = if labels.len() >= 2 && labels[0].contains("api") && labels[1] == "onelake" {
format!("{}-{}", labels[0], labels[1])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are referring workspace id as account_name, then this will not work for non pl scenario. In those case it will give account_name as "westus-api-onelake"

@SmritiAgrawal04 SmritiAgrawal04 marked this pull request as ready for review January 9, 2026 05:03
@SmritiAgrawal04
Copy link
Author

Hi @alamb & @crepererum,

I added the unit tests as suggested. I request to review the PR please.

let mut builder = MicrosoftAzureBuilder::new();
builder
.parse_url("https://account.blob.fabric.microsoft.com/container")
.parse_url("https://account.blob.fabric.microsoft.com/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this test case change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By mistake. Reverted.

}
Some((a, "dfs.fabric.microsoft.com")) | Some((a, "blob.fabric.microsoft.com")) => {
self.account_name = Some(validate(a)?);
// Attempt to infer the container name from the URL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this comment? It seems helpful

}
"https" => {
// Regex to match WS-PL FQDN:
// "{workspaceid}.z??.(onelake|dfs|blob).fabric.microsoft.com"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also add an example URL for each of the APIs you are adding support for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for WS-PL DFS/ Blob endpoints. We are waiting for PM to confirm on ABFSS & WS-PL onelake domains.

self.container_name = Some(validate(parsed.username())?);
self.account_name = Some(validate(a)?);
self.use_fabric_endpoint = true.into();
} else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft doc so we can cite it in code/tests?

I don't see it in https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding I think. Is this a vaild url?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SmritiAgrawal04 -- i left some more questions

BTW I asked codex for a review of this PR like this

please review this PR (checked out in this directory): https://github.com/apache/arrow-rs-object-store/pull/552

  Specifically ensure that the format of the URLs supported matches publically published microsoft urls and provide documentation references

And this was the output which I found helpful

───────────────────────────────────────────────────────────────────────────────────────────────

• Findings

  • High: -api.onelake.fabric.microsoft.com is accepted for both abfs(s) and https, but I can’t find this FQDN in
    the public OneLake URL docs. The published endpoints are onelake.dfs.fabric.microsoft.com and workspace-scoped
    https://{workspaceId}.z{xy}.dfs.fabric.microsoft.com/... (see https://learn.microsoft.com/en-us/fabric/onelake/
    onelake-access-api). If -api.onelake is internal or not publicly documented, this should be gated or removed.
  • Medium: The WS‑PL regex no longer verifies that z{xy} matches the first two hex chars of the workspace id (src/
    azure/builder.rs:692-708). The published WS‑PL format is https://
    {workspaceId}.z{first2}.dfs.fabric.microsoft.com (same doc). Without the check, non‑published FQDNs are
    accepted.
  • Medium: WS‑PL regex allows .onelake.fabric.microsoft.com (src/azure/builder.rs:692-699). The public docs list
    dfs (and blob endpoints for blob access), but I don’t see onelake as a WS‑PL subdomain. Please confirm with a
    Microsoft reference or remove to match published formats.
  • Low: WS‑PL regex only allows lowercase hex ([0-9a-f]{32}) (src/azure/builder.rs:697). If Microsoft treats
    workspace IDs as case‑insensitive GUIDs in URLs, this will reject valid uppercase forms; consider documenting
    that only lowercase is supported or make the regex case‑insensitive.

Doc references

  • OneLake access API URL formats: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
  • Private Link / workspace-specific OneLake endpoints (if this PR targets WS‑PL): please confirm the exact doc
    section that defines the workspaceId.z{xy}.dfs.fabric.microsoft.com pattern and whether any onelake WS‑PL host
    is documented (I could not find it in public docs).

Questions / assumptions

  • Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft
    doc so we can cite it in code/tests?
  • Should WS‑PL accept only dfs/blob subdomains, or is .onelake.fabric.microsoft.com explicitly published?

@tustvold
Copy link
Contributor

FYI #604 may be related.

@SmritiAgrawal04
Copy link
Author

Thanks @SmritiAgrawal04 -- i left some more questions

BTW I asked codex for a review of this PR like this

please review this PR (checked out in this directory): https://github.com/apache/arrow-rs-object-store/pull/552

  Specifically ensure that the format of the URLs supported matches publically published microsoft urls and provide documentation references

And this was the output which I found helpful

───────────────────────────────────────────────────────────────────────────────────────────────

• Findings

  • High: -api.onelake.fabric.microsoft.com is accepted for both abfs(s) and https, but I can’t find this FQDN in
    the public OneLake URL docs. The published endpoints are onelake.dfs.fabric.microsoft.com and workspace-scoped
    https://{workspaceId}.z{xy}.dfs.fabric.microsoft.com/... (see https://learn.microsoft.com/en-us/fabric/onelake/
    onelake-access-api). If -api.onelake is internal or not publicly documented, this should be gated or removed.
  • Medium: The WS‑PL regex no longer verifies that z{xy} matches the first two hex chars of the workspace id (src/
    azure/builder.rs:692-708). The published WS‑PL format is https://
    {workspaceId}.z{first2}.dfs.fabric.microsoft.com (same doc). Without the check, non‑published FQDNs are
    accepted.
  • Medium: WS‑PL regex allows .onelake.fabric.microsoft.com (src/azure/builder.rs:692-699). The public docs list
    dfs (and blob endpoints for blob access), but I don’t see onelake as a WS‑PL subdomain. Please confirm with a
    Microsoft reference or remove to match published formats.
  • Low: WS‑PL regex only allows lowercase hex ([0-9a-f]{32}) (src/azure/builder.rs:697). If Microsoft treats
    workspace IDs as case‑insensitive GUIDs in URLs, this will reject valid uppercase forms; consider documenting
    that only lowercase is supported or make the regex case‑insensitive.

Doc references

  • OneLake access API URL formats: https://learn.microsoft.com/en-us/fabric/onelake/onelake-access-api
  • Private Link / workspace-specific OneLake endpoints (if this PR targets WS‑PL): please confirm the exact doc
    section that defines the workspaceId.z{xy}.dfs.fabric.microsoft.com pattern and whether any onelake WS‑PL host
    is documented (I could not find it in public docs).

Questions / assumptions

  • Is *-api.onelake.fabric.microsoft.com a publicly documented endpoint? If yes, can you point to the Microsoft
    doc so we can cite it in code/tests?
  • Should WS‑PL accept only dfs/blob subdomains, or is .onelake.fabric.microsoft.com explicitly published?

Hi @alamb,

I have addressed all comments. About, finding #1 & #3, we plan to add it to the public documentation, the PR for which is already out. I request to approve these changes meanwhile. Thanks

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this @SmritiAgrawal04

I am sorry for the length of time but our review bandwdith is very limited here

Part of the reason it has taken so long to review is there is no ticket or description of what issue this is fixing, nor good examples of what new features you have added

In order to proceed, please:

  1. Fill out the PR description with a clear description of what problem you are solving. This shoud, at a minimum,, provide URLs that can't be parsed with the current crate, and examples of the expected output
  2. Provide a link to the documentation of the format of onelake URLs so that we can verify the URLs should be parsed
  3. Add a link to the docs / references in the code so that future readers can refer back to them

self.container_name = Some(validate(parsed.username())?);
self.account_name = Some(validate(a)?);
self.use_fabric_endpoint = true.into();
} else if let Some(a) = host.strip_suffix("-api.onelake.fabric.microsoft.com") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still outstanding I think. Is this a vaild url?

itertools = "0.14.0"
parking_lot = { version = "0.12" }
percent-encoding = "2.1"
regex = "1.11.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used for azure, in order to keep the dependency chain minimal, can we please make this an optional dependency, following for example base64?

And then activate like this?

azure = ["cloud", "httparse"]

builder
.parse_url("https://Ab000000000000000000000000000000.zAb.dfs.fabric.microsoft.com/")
.unwrap();
assert_eq!(builder.account_name, Some("ab000000000000000000000000000000.zab".to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the account name really supposed to have the ab000000000000000000000000000000 in it? It seems confusing that the container name is also ab000000000000000000000000000000


let ws_pl_regex = WS_PL_REGEX.get_or_init(|| {
Regex::new(
r"(?i)^(?P<workspaceid>[0-9a-f]{32})\.z(?P<xy>[0-9a-f]{2})\.(onelake|dfs|blob)\.fabric\.microsoft\.com$"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the rationale for using regexp math here rather than just hostname splitting? The examples / tests you have added don't seem to test for invalid URLs / that don't follow a simple .... format

@alamb
Copy link
Contributor

alamb commented Jan 17, 2026

I have addressed all comments. About, finding #1 & #3, we plan to add it to the public documentation, the PR for which is already out. I request to approve these changes meanwhile. Thanks

Some comments remain unaddressed. In terms of documentation, please include it in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants