Skip to content

Conversation

liangyu-1
Copy link
Contributor

…on because of some concurrent execution.

Description of PR

As Described in MAPREDUCE-7508

When Spark scans for files, it uses the FileInputFormat.getSplits() method to split the file. The first step in getSplits is to retrieve the file's length. If the file length is not zero, the next step is to get the block locations array for that file. However, if the two upstream programs rapidly create and write to the same file (i.e., the file is overwritten or appended to almost simultaneously), a race condition may occur:

The file's length is already non-zero,

but calling getFileBlockLocations() returns an empty array because the file is being overwritten or is not yet fully written.

When this happens, subsequent logic in getSplits (such as accessing the last element of the block locations array) will throw an ArrayIndexOutOfBoundsException because the block locations array is unexpectedly empty.

How was this patch tested?

I rebuild the project and ran on our cluster, spark did not throw Execptions.

For code changes:

If Array blkLocations is empty, it will continue to next iteration, so that it will now find the the last blockLocation of this file.

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@yangjiandan
Copy link
Contributor

yangjiandan commented Aug 8, 2025

LGTM!
The submitted patch clearly addresses a rare but potentially hazardous issue where FileInputFormat could throw an ArrayIndexOutOfBoundsException under concurrent execution. It's a focused and well-implemented fix and seems necessary to improve the stability of the HDFS component. @slfan1989 Could you help take a look at this PR as well?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 44m 46s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 43s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 34s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 26s trunk passed
+1 💚 shadedclient 41m 19s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 35s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 31s the patch passed
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 20s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 23s the patch passed
+1 💚 shadedclient 41m 10s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 50s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
148m 39s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/artifact/out/Dockerfile
GITHUB PR #7859
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 0d28eceec116 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / d2ba001
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/testReport/
Max. process+thread count 1078 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@@ -369,6 +369,9 @@ public InputSplit[] getSplits(JobConf job, int numSplits)
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (blkLocations.length == 0){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contributions. Just wonder why it will meet length != 0 but blkLocations.length == 0, some corner case that lead inconsistent metadata of NameNode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Hexiaoqiao thanks for review.

It is a rare corner case, only happens when using spark streaming monitors a path where the upstream system sometimes starts two identical tasks that attempt to create and write to the same HDFS file simultaneously. This can lead to conflicts where a file is created and written to twice in quick succession.

When Spark scans for files, it uses the FileInputFormat.getSplits() method to split the file. The first step in getSplits is to retrieve the file's length. If the file length is not zero, the next step is to get the block locations array for that file. However, if the two upstream programs rapidly create and write to the same file (i.e., the file is overwritten or appended to almost simultaneously), a race condition may occur:

The file's length is already non-zero, but calling getFileBlockLocations() returns an empty array because the file is being overwritten or is not yet fully written.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it. Make sense to me. +1 to check in.
BTW, the root cause here is invoke listStatus to get file status and invoke another interface getFileBlockLocations to get block location, but file has changed between this steps, right? If it is true, is it proper to use blkLocations only as condition at L364 rather than file.length which could be not the correct result here? Thanks again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Got it. Make sense to me. +1 to check in. BTW, the root cause here is invoke listStatus to get file status and invoke another interface getFileBlockLocations to get block location, but file has changed between this steps, right? If it is true, is it proper to use blkLocations only as condition at L364 rather than file.length which could be not the correct result here? Thanks again.

Thanks for your review. It is a good solution to use blkLocations only as condition at L364, if blkLocations array is empty, file.length is also empty, this will not effect the later opretions.

@liangyu-1 liangyu-1 requested a review from Hexiaoqiao August 22, 2025 09:54
@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 21m 18s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 43m 31s trunk passed
+1 💚 compile 0m 44s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 38s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 44s trunk passed
+1 💚 mvnsite 0m 44s trunk passed
+1 💚 javadoc 0m 34s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 24s trunk passed
+1 💚 shadedclient 41m 59s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 35s the patch passed
+1 💚 compile 0m 31s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 31s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 31s /results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 3 new + 73 unchanged - 3 fixed = 76 total (was 76)
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 20s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 22s the patch passed
+1 💚 shadedclient 41m 0s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 53s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
168m 18s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/3/artifact/out/Dockerfile
GITHUB PR #7859
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux db1e18bd8a74 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 75ca4e1
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/3/testReport/
Max. process+thread count 1076 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/3/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@Hexiaoqiao
Copy link
Contributor

LGTM. Please try to run CI again and better to get happy result. Thanks @liangyu-1

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 21m 3s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 22s trunk passed
+1 💚 compile 0m 45s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 37s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 43s trunk passed
+1 💚 mvnsite 0m 46s trunk passed
+1 💚 javadoc 0m 33s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 26s trunk passed
+1 💚 shadedclient 41m 13s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 31s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 35s the patch passed
+1 💚 compile 0m 30s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 30s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 31s /results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 3 new + 73 unchanged - 3 fixed = 76 total (was 76)
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 20s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 21s the patch passed
+1 💚 shadedclient 41m 21s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 50s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
170m 26s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/4/artifact/out/Dockerfile
GITHUB PR #7859
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 8b1e9b121e90 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2d67f87
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/4/testReport/
Max. process+thread count 1063 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/4/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 50s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 5s trunk passed
+1 💚 compile 0m 43s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 39s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 checkstyle 0m 44s trunk passed
+1 💚 mvnsite 0m 45s trunk passed
+1 💚 javadoc 0m 34s trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 28s trunk passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 24s trunk passed
+1 💚 shadedclient 41m 12s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 32s the patch passed
+1 💚 compile 0m 35s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 35s the patch passed
+1 💚 compile 0m 30s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 javac 0m 30s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 32s /results-checkstyle-hadoop-mapreduce-project_hadoop-mapreduce-client_hadoop-mapreduce-client-core.txt hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core: The patch generated 3 new + 73 unchanged - 3 fixed = 76 total (was 76)
+1 💚 mvnsite 0m 34s the patch passed
+1 💚 javadoc 0m 21s the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 20s the patch passed with JDK Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
+1 💚 spotbugs 1m 25s the patch passed
+1 💚 shadedclient 41m 5s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 8m 56s hadoop-mapreduce-client-core in the patch passed.
+1 💚 asflicense 0m 36s The patch does not generate ASF License warnings.
149m 55s
Subsystem Report/Notes
Docker ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/5/artifact/out/Dockerfile
GITHUB PR #7859
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 79afc0f8917a 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 49dc7dd
Default Java Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-gaus1-0ubuntu120.04-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/5/testReport/
Max. process+thread count 1107 (vs. ulimit of 5500)
modules C: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core U: hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7859/5/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@liangyu-1
Copy link
Contributor Author

@Hexiaoqiao
hi, the failure occurred because I didn't add new UT for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants