YARN-11844: Support configuration of retry policy on GPU discovery #7857

cnauroth · 2025-08-05T18:07:28Z

Description of PR

The NodeManager invokes an external binary (e.g. nvidia-smi) to discover attached GPUs. Right now, there is a hard-coded 10-second timeout on execution of this binary and a hard-coded max error count of 10, beyond which the NodeManager will stop attempting discovery. This change will provide new configuration properties to control both the timeout and the max errors, which is useful in environments where there may be a delay in binding the GPU to the host. Default values for the new configuration properties will be set so as to maintain the existing behavior.

Special thanks to @jjayadeep06 for co-authoring this patch.

How was this patch tested?

New unit test and manual testing.

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

hadoop-yetus · 2025-08-05T22:23:14Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	24m 11s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	9m 20s		Maven dependency ordering for branch
+1 💚	mvninstall	37m 58s		trunk passed
+1 💚	compile	6m 12s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	5m 55s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	2m 13s		trunk passed
+1 💚	mvnsite	3m 25s		trunk passed
+1 💚	javadoc	3m 21s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	3m 17s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 40s		trunk passed
+1 💚	shadedclient	42m 46s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 32s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 43s		the patch passed
+1 💚	compile	5m 20s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	5m 20s		the patch passed
+1 💚	compile	5m 1s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	5m 1s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 52s	/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt	hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 164 unchanged - 0 fixed = 166 total (was 164)
+1 💚	mvnsite	2m 32s		the patch passed
+1 💚	javadoc	2m 36s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 23s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 45s		the patch passed
+1 💚	shadedclient	41m 37s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	1m 6s		hadoop-yarn-api in the patch passed.
+1 💚	unit	5m 47s		hadoop-yarn-common in the patch passed.
+1 💚	unit	27m 14s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	1m 0s		The patch does not generate ASF License warnings.
		254m 18s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/1/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux b4cf637d6047 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `8d04c84`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/1/testReport/
Max. process+thread count	536 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Closes apache#7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]>

hadoop-yetus · 2025-08-06T18:55:21Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 24s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	0m 24s		Maven dependency ordering for branch
-1 ❌	mvninstall	0m 22s	/branch-mvninstall-root.txt	root in trunk failed.
-1 ❌	compile	1m 49s	/branch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt	hadoop-yarn in trunk failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌	compile	0m 56s	/branch-compile-hadoop-yarn-project_hadoop-yarn-jdkPrivateBuild-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.txt	hadoop-yarn in trunk failed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.
+1 💚	checkstyle	4m 25s		trunk passed
-1 ❌	mvnsite	0m 15s	/branch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in trunk failed.
+1 💚	javadoc	1m 8s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	0m 55s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
-1 ❌	spotbugs	0m 11s	/branch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in trunk failed.
+1 💚	shadedclient	23m 59s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	2m 8s		Maven dependency ordering for patch
-1 ❌	mvninstall	0m 9s	/patch-mvninstall-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in the patch failed.
-1 ❌	compile	0m 25s	/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt	hadoop-yarn in the patch failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌	javac	0m 25s	/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkUbuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.txt	hadoop-yarn in the patch failed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04.
-1 ❌	compile	0m 23s	/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkPrivateBuild-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.txt	hadoop-yarn in the patch failed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.
-1 ❌	javac	0m 23s	/patch-compile-hadoop-yarn-project_hadoop-yarn-jdkPrivateBuild-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.txt	hadoop-yarn in the patch failed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09.
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 42s	/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt	hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 164 unchanged - 0 fixed = 166 total (was 164)
-1 ❌	mvnsite	0m 11s	/patch-mvnsite-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in the patch failed.
+1 💚	javadoc	1m 9s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 8s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
-1 ❌	spotbugs	0m 9s	/patch-spotbugs-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in the patch failed.
+1 💚	shadedclient	21m 9s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 34s		hadoop-yarn-api in the patch passed.
-1 ❌	unit	0m 13s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-common.txt	hadoop-yarn-common in the patch failed.
-1 ❌	unit	0m 13s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt	hadoop-yarn-server-nodemanager in the patch failed.
+1 💚	asflicense	0m 25s		The patch does not generate ASF License warnings.
		71m 34s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/2/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux 9e3d228d4477 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `5db27ec`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/2/testReport/
Max. process+thread count	701 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-08-07T17:05:42Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 21s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	10m 10s		Maven dependency ordering for branch
+1 💚	mvninstall	20m 30s		trunk passed
+1 💚	compile	2m 55s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	2m 40s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	0m 59s		trunk passed
+1 💚	mvnsite	1m 39s		trunk passed
+1 💚	javadoc	1m 47s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 43s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 26s		trunk passed
+1 💚	shadedclient	21m 30s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 21s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 6s		the patch passed
+1 💚	compile	2m 41s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	2m 41s		the patch passed
+1 💚	compile	2m 43s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	2m 43s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 54s	/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt	hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 164 unchanged - 0 fixed = 166 total (was 164)
+1 💚	mvnsite	1m 37s		the patch passed
+1 💚	javadoc	1m 43s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 41s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 39s		the patch passed
+1 💚	shadedclient	21m 32s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 44s		hadoop-yarn-api in the patch passed.
+1 💚	unit	4m 27s		hadoop-yarn-common in the patch passed.
+1 💚	unit	24m 7s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 39s		The patch does not generate ASF License warnings.
		137m 40s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/4/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux f966db9c6ea7 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `5db27ec`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/4/testReport/
Max. process+thread count	555 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/4/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

ayushtkn · 2025-08-07T17:12:36Z

...he/hadoop/yarn/server/nodemanager/containermanager/resourceplugin/gpu/TestGpuDiscoverer.java

+    final String msg = "Failed to execute GPU device detection script";
+
+    // The default max errors is 10. Verify that it keeps going for an 11th try.
+    for (int i = 0; i < 11; ++i) {


I changed this 11 to 15 & still the test doesn't fail for me, can you check once?

This test is covering the case where you disable the max errors by setting a negative value. To make this clearer, I dialed it up to 20 attempts, and I also added another test that sets the configuration to 11 and confirms it tries exactly 11 times.

ayushtkn · 2025-08-07T17:12:54Z

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml

+      yarn.nodemanager.resource-plugins.gpu.discovery-max-errors.
+    </description>
+    <name>yarn.nodemanager.resource-plugins.gpu.discovery-timeout</name>
+    <value>10000ms</value>


any reason for not using 10s?

Sounds good, updated.

hadoop-yetus · 2025-08-08T02:36:55Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 51s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	9m 7s		Maven dependency ordering for branch
+1 💚	mvninstall	37m 37s		trunk passed
+1 💚	compile	5m 57s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	5m 7s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	2m 0s		trunk passed
+1 💚	mvnsite	2m 51s		trunk passed
+1 💚	javadoc	2m 55s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 44s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 29s		trunk passed
+1 💚	shadedclient	42m 12s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 59s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 44s		the patch passed
+1 💚	compile	5m 25s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	5m 25s		the patch passed
+1 💚	compile	5m 4s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	5m 4s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	1m 55s	/results-checkstyle-hadoop-yarn-project_hadoop-yarn.txt	hadoop-yarn-project/hadoop-yarn: The patch generated 2 new + 164 unchanged - 0 fixed = 166 total (was 164)
+1 💚	mvnsite	2m 34s		the patch passed
+1 💚	javadoc	2m 37s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 28s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 47s		the patch passed
+1 💚	shadedclient	42m 26s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	1m 5s		hadoop-yarn-api in the patch passed.
+1 💚	unit	5m 48s		hadoop-yarn-common in the patch passed.
-1 ❌	unit	27m 14s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt	hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	1m 0s		The patch does not generate ASF License warnings.
		226m 16s

Reason	Tests
Failed junit tests	hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.TestGpuDiscoverer

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/5/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux b2f5bc390399 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `7608230`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/5/testReport/
Max. process+thread count	589 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/5/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-08-08T20:01:32Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 49s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 1s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	12m 12s		Maven dependency ordering for branch
+1 💚	mvninstall	37m 58s		trunk passed
+1 💚	compile	5m 59s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	5m 6s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	2m 6s		trunk passed
+1 💚	mvnsite	2m 49s		trunk passed
+1 💚	javadoc	2m 56s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 41s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 29s		trunk passed
+1 💚	shadedclient	42m 4s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 32s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 43s		the patch passed
+1 💚	compile	5m 16s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	5m 16s		the patch passed
+1 💚	compile	5m 0s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	5m 0s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	1m 51s		the patch passed
+1 💚	mvnsite	2m 28s		the patch passed
+1 💚	javadoc	2m 37s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 26s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 45s		the patch passed
+1 💚	shadedclient	42m 42s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	1m 6s		hadoop-yarn-api in the patch passed.
+1 💚	unit	6m 4s		hadoop-yarn-common in the patch passed.
-1 ❌	unit	28m 10s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt	hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 57s		The patch does not generate ASF License warnings.
		229m 57s

Reason	Tests
Failed junit tests	hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.TestGpuDiscoverer

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/6/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux a3806f78278b 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `42ccf81`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/6/testReport/
Max. process+thread count	579 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/6/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-08-08T22:58:26Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 20s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+0 🆗	xmllint	0m 1s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	9m 38s		Maven dependency ordering for branch
+1 💚	mvninstall	19m 53s		trunk passed
+1 💚	compile	2m 57s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	2m 44s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	1m 0s		trunk passed
+1 💚	mvnsite	1m 53s		trunk passed
+1 💚	javadoc	1m 53s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 47s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 26s		trunk passed
+1 💚	shadedclient	21m 34s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 23s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 3s		the patch passed
+1 💚	compile	2m 39s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	2m 39s		the patch passed
+1 💚	compile	2m 37s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	2m 37s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 54s		the patch passed
+1 💚	mvnsite	1m 33s		the patch passed
+1 💚	javadoc	1m 32s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	1m 42s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	3m 36s		the patch passed
+1 💚	shadedclient	21m 28s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	0m 41s		hadoop-yarn-api in the patch passed.
+1 💚	unit	4m 45s		hadoop-yarn-common in the patch passed.
-1 ❌	unit	24m 6s	/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt	hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 39s		The patch does not generate ASF License warnings.
		136m 50s

Reason	Tests
Failed junit tests	hadoop.yarn.server.nodemanager.containermanager.resourceplugin.gpu.TestGpuDiscoverer

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/7/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux 10b8de8c0b98 5.15.0-143-generic #153-Ubuntu SMP Fri Jun 13 19:10:45 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `5e299fd`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/7/testReport/
Max. process+thread count	546 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/7/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2025-08-09T04:30:38Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 50s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+0 🆗	xmllint	0m 0s		xmllint was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+0 🆗	mvndep	9m 14s		Maven dependency ordering for branch
+1 💚	mvninstall	38m 5s		trunk passed
+1 💚	compile	5m 59s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	compile	5m 7s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	checkstyle	2m 0s		trunk passed
+1 💚	mvnsite	2m 50s		trunk passed
+1 💚	javadoc	2m 57s		trunk passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 40s		trunk passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 28s		trunk passed
+1 💚	shadedclient	42m 19s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+0 🆗	mvndep	0m 32s		Maven dependency ordering for patch
+1 💚	mvninstall	1m 43s		the patch passed
+1 💚	compile	5m 18s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javac	5m 18s		the patch passed
+1 💚	compile	5m 3s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	javac	5m 3s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	1m 53s		the patch passed
+1 💚	mvnsite	2m 32s		the patch passed
+1 💚	javadoc	2m 36s		the patch passed with JDK Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04
+1 💚	javadoc	2m 24s		the patch passed with JDK Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
+1 💚	spotbugs	5m 43s		the patch passed
+1 💚	shadedclient	41m 36s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	1m 4s		hadoop-yarn-api in the patch passed.
+1 💚	unit	5m 47s		hadoop-yarn-common in the patch passed.
+1 💚	unit	27m 12s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 56s		The patch does not generate ASF License warnings.
		225m 6s

Subsystem	Report/Notes
Docker	ClientAPI=1.51 ServerAPI=1.51 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/8/artifact/out/Dockerfile
GITHUB PR	#7857
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint
uname	Linux 5f19b2e6d0ce 5.15.0-144-generic #157-Ubuntu SMP Mon Jun 16 07:33:10 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `acf8406`
Default Java	Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.27+6-post-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_452-8u452-ga~~us1-0ubuntu1~~20.04-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/8/testReport/
Max. process+thread count	593 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7857/8/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

cnauroth · 2025-08-09T05:42:39Z

@ayushtkn , thanks for the review! All comments addressed, and +1 from Yetus. How does this look now?

ayushtkn

LGTM

Closes #7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit 0f34922)

Closes #7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]> Signed-off-by: Ayush Saxena <[email protected]> (cherry picked from commit 0f34922) (cherry picked from commit f2b69ba)

cnauroth · 2025-08-11T19:57:08Z

I committed to trunk and merged to branch-3.4 and branch-3.3 with some minor conflicts resolved. @ayushtkn and @hotcodemacha , thank you for the reviews.

github-actions bot added YARN trunk labels Aug 5, 2025

hotcodemacha approved these changes Aug 5, 2025

View reviewed changes

YARN-11844: Support configuration of retry policy on GPU discovery

5db27ec

Closes apache#7857 Co-authored-by: Jayadeep Jayaraman <[email protected]> Reviewed-by: Ashutosh Gupta <[email protected]>

cnauroth force-pushed the YARN-11844 branch from 8d04c84 to 5db27ec Compare August 6, 2025 17:42

ayushtkn reviewed Aug 7, 2025

View reviewed changes

YARN-11844: code review feedback

7608230

YARN-11844: Checkstyle

42ccf81

YARN-11844: Fix test assertion on error message to say "11 times"

5e299fd

YARN-11844: Correct off-by-one in error count assertions

acf8406

ayushtkn approved these changes Aug 9, 2025

View reviewed changes

cnauroth closed this in 0f34922 Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

YARN-11844: Support configuration of retry policy on GPU discovery #7857

YARN-11844: Support configuration of retry policy on GPU discovery #7857

Uh oh!

cnauroth commented Aug 5, 2025

Uh oh!

hadoop-yetus commented Aug 5, 2025

Uh oh!

hadoop-yetus commented Aug 6, 2025

Uh oh!

hadoop-yetus commented Aug 7, 2025

Uh oh!

ayushtkn Aug 7, 2025

Uh oh!

cnauroth Aug 7, 2025

Uh oh!

ayushtkn Aug 7, 2025

Uh oh!

cnauroth Aug 7, 2025

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 9, 2025

Uh oh!

cnauroth commented Aug 9, 2025

Uh oh!

ayushtkn left a comment

Uh oh!

cnauroth commented Aug 11, 2025

Uh oh!

Uh oh!

YARN-11844: Support configuration of retry policy on GPU discovery #7857

YARN-11844: Support configuration of retry policy on GPU discovery #7857

Uh oh!

Conversation

cnauroth commented Aug 5, 2025

Description of PR

How was this patch tested?

For code changes:

Uh oh!

hadoop-yetus commented Aug 5, 2025

Uh oh!

hadoop-yetus commented Aug 6, 2025

Uh oh!

hadoop-yetus commented Aug 7, 2025

Uh oh!

ayushtkn Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

cnauroth Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

ayushtkn Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

cnauroth Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 8, 2025

Uh oh!

hadoop-yetus commented Aug 9, 2025

Uh oh!

cnauroth commented Aug 9, 2025

Uh oh!

ayushtkn left a comment

Choose a reason for hiding this comment

Uh oh!

cnauroth commented Aug 11, 2025

Uh oh!

Uh oh!