GH-47560: [C++] Fix host handling for default HDFS URI#47458
GH-47560: [C++] Fix host handling for default HDFS URI#47458pitrou merged 4 commits intoapache:mainfrom
Conversation
…d via `from_uri()` In apache#25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use `from_uri()`, as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in $HADOOP_CONF_DIR/core-site.xml.
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
Any work on this? It is not working the "default" value of HDFS host if used in |
|
I added a new issue. Now this pull request fixes #47560. |
from_uri())from_uri())
|
Thank you for proposing a fix! I am not sure how to go about testing this. cc @pitrou |
from_uri())|
Hello!
I think the logic goes like this:
Then, in either case, I leave the variable If the host is not "default" in any case, the old behaviour (scheme://host:port) is maintained.
I am not sure, either. I will investigate, but I remember a test for I'll look also why the CI is failing, because this change is so small that it is unlikely it is failing because of it. |
|
I'm not sure we need to test this specifically. OTOH, the hdfs Crossbow-based CI jobs should not regress. |
|
@pitrou I've synched the branch just in case that precise moment the tests were failing. We'll see because as I said, the changes are so small that should not affect any CI build. |
|
@github-actions crossbow submit hdfs |
|
Revision: 4561d70 Submitted crossbow builds: ursacomputing/crossbow @ actions-2f87af1fda
|
|
OK, the tests pass all except for a timeout in the azure connection (lasted for 56 minutes). |
|
|
Add clarification comment on what "default" host means for libhdfs. Co-authored-by: Antoine Pitrou <pitrou@free.fr>
|
Ouch. Comment length exceeded that expected by the linter :) |
|
CI failures are unrelated, will merge. |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 7129321. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
…47458) ### Rationale for this change In apache#25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use `from_uri()`, as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in `$HADOOP_CONF_DIR/core-site.xml`. ### What changes are included in this PR? Handle the `HadoopFileSystem.from_uri()` (or `FileSystem.from_uri()` when using `hdfs://default:xxx`) special HDFS URIs. ### Are these changes tested? There are no specific tests for this feature, but existing HDFS CI jobs pass. ### Are there any user-facing changes? Not exactly, but the documentation is honored for the `from_uri()` case. * GitHub Issue: apache#47560 Lead-authored-by: Diego Sevilla Ruiz <dsevilla@um.es> Co-authored-by: Antoine Pitrou <pitrou@free.fr> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
In #25324 a fix is introduced for the python HadoopFileSystem, but it does not work if you use
from_uri(), as it is passed to the underlying C++ implementation of the options parsing. The "default" case is not handled as in the python case, as the whole "hdfs://default" is passed to the underlying hdfs library, that expect "default" to search in$HADOOP_CONF_DIR/core-site.xml.What changes are included in this PR?
Handle the
HadoopFileSystem.from_uri()(orFileSystem.from_uri()when usinghdfs://default:xxx) special HDFS URIs.Are these changes tested?
There are no specific tests for this feature, but existing HDFS CI jobs pass.
Are there any user-facing changes?
Not exactly, but the documentation is honored for the
from_uri()case.