Skip to content

GH-47560: [C++] Fix host handling for default HDFS URI#47458

Merged
pitrou merged 4 commits intoapache:mainfrom
dsevilla:patch-3
Sep 23, 2025
Merged

GH-47560: [C++] Fix host handling for default HDFS URI#47458
pitrou merged 4 commits intoapache:mainfrom
dsevilla:patch-3

Conversation

@dsevilla
Copy link
Copy Markdown
Contributor

@dsevilla dsevilla commented Aug 29, 2025

Rationale for this change

In #25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use from_uri(), as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in $HADOOP_CONF_DIR/core-site.xml.

What changes are included in this PR?

Handle the HadoopFileSystem.from_uri() (or FileSystem.from_uri() when using hdfs://default:xxx) special HDFS URIs.

Are these changes tested?

There are no specific tests for this feature, but existing HDFS CI jobs pass.

Are there any user-facing changes?

Not exactly, but the documentation is honored for the from_uri() case.

…d via `from_uri()`

In apache#25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use `from_uri()`, as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in $HADOOP_CONF_DIR/core-site.xml.
@github-actions
Copy link
Copy Markdown

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@dsevilla
Copy link
Copy Markdown
Contributor Author

Any work on this? It is not working the "default" value of HDFS host if used in from_uri().

@dsevilla
Copy link
Copy Markdown
Contributor Author

I added a new issue. Now this pull request fixes #47560.

@dsevilla dsevilla changed the title Fix host handling for default HDFS URI (fix for #25324 when used via from_uri()) Fix host handling for default HDFS URI (fix for #25324 and #47560 when used via from_uri()) Sep 14, 2025
@AlenkaF
Copy link
Copy Markdown
Member

AlenkaF commented Sep 18, 2025

Thank you for proposing a fix!
In principle the change looks ok to me. The comment did confuse me a bit (Special case host = "default" or "hdfs://default" as stated by #25324.). If I understand correctly this change is only dealing with host being default and not hdfs://default also?

I am not sure how to go about testing this. cc @pitrou

@AlenkaF AlenkaF changed the title Fix host handling for default HDFS URI (fix for #25324 and #47560 when used via from_uri()) [C++] Fix host handling for default HDFS URI Sep 18, 2025
@dsevilla
Copy link
Copy Markdown
Contributor Author

dsevilla commented Sep 18, 2025

Hello!

Thank you for proposing a fix! In principle the change looks ok to me. The comment did confuse me a bit (Special case host = "default" or "hdfs://default" as stated by #25324.). If I understand correctly this change is only dealing with host being default and not hdfs://default also?

I think the logic goes like this:

  • If the URI is hdfs://default:0/..., then, uri.host() is "default", and uri.scheme() is hdfs.
  • If the URI is default:0/..., then again uri.host() is still "default", but uri.scheme() is empty.

Then, in either case, I leave the variable host (that will hold the host plus optional scheme and will be passed down to the library) as just "default". This way libhdfs will see just default:0/....

If the host is not "default" in any case, the old behaviour (scheme://host:port) is maintained.

I am not sure how to go about testing this. cc @pitrou

I am not sure, either. I will investigate, but I remember a test for HadoopFileSystem using "default" as a host (that works as per the other previous patch). It can be added there. I'll search for it ASAP.

I'll look also why the CI is failing, because this change is so small that it is unlikely it is failing because of it.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Sep 18, 2025

I'm not sure we need to test this specifically. OTOH, the hdfs Crossbow-based CI jobs should not regress.

@dsevilla
Copy link
Copy Markdown
Contributor Author

@pitrou I've synched the branch just in case that precise moment the tests were failing. We'll see because as I said, the changes are so small that should not affect any CI build.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Sep 23, 2025

@github-actions crossbow submit hdfs

@github-actions
Copy link
Copy Markdown

Revision: 4561d70

Submitted crossbow builds: ursacomputing/crossbow @ actions-2f87af1fda

Task Status
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions

@dsevilla
Copy link
Copy Markdown
Contributor Author

OK, the tests pass all except for a timeout in the azure connection (lasted for 56 minutes).

@pitrou pitrou changed the title [C++] Fix host handling for default HDFS URI GH-47560: [C++] Fix host handling for default HDFS URI Sep 23, 2025
@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #47560 has been automatically assigned in GitHub to PR creator.

Add clarification comment on what "default" host means for libhdfs.

Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 23, 2025
@dsevilla
Copy link
Copy Markdown
Contributor Author

Ouch. Comment length exceeded that expected by the linter :)

@pitrou
Copy link
Copy Markdown
Member

pitrou commented Sep 23, 2025

CI failures are unrelated, will merge.

@pitrou pitrou merged commit 7129321 into apache:main Sep 23, 2025
39 of 41 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Sep 23, 2025
@dsevilla dsevilla deleted the patch-3 branch September 23, 2025 22:45
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7129321.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Oct 15, 2025
…47458)

### Rationale for this change

In apache#25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use `from_uri()`, as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in `$HADOOP_CONF_DIR/core-site.xml`.

### What changes are included in this PR?

Handle the `HadoopFileSystem.from_uri()` (or `FileSystem.from_uri()` when using `hdfs://default:xxx`) special HDFS URIs.

### Are these changes tested?

There are no specific tests for this feature, but existing HDFS CI jobs pass.

### Are there any user-facing changes?

Not exactly, but the documentation is honored for the `from_uri()` case.

* GitHub Issue: apache#47560

Lead-authored-by: Diego Sevilla Ruiz <dsevilla@um.es>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants