Skip to content

Conversation

@ldematte
Copy link
Contributor

@ldematte ldematte commented Oct 31, 2025

This PR attempts to fix (definitely) Windows services tests.

We used procrun as a service control tool; however, procrun (as a service) suffers from a race condition and procrun as a tool seems to suffer from it.
Furthermore, the implementation of stop in procrun is (at the very least), "strange": it seems to wait for he service to stop, but it's not guaranteed to wait till the end, returning an error (potentially)
In general, this is true for all (most) of the service control tools: the return when the service is still in a STOP_PENDING state. We actually have to "busy wait" for the service to be really STOPPED.

This PR adds the busy wait and adjusts how we stop the service slightly. It also moves to the standard Windows sc.exe tool for service control operations (start, stop and delete) in tests.

Fixes: #113177
Fixes: #113160
Fixes: #113219
Fixes: #113313

@ldematte ldematte added >test Issues or PRs that are addressing/adding tests :Core/Infra/CLI CLI utilities, scripts, and infrastructure test-windows Trigger CI checks on Windows labels Oct 31, 2025
@elasticsearchmachine elasticsearchmachine added v9.3.0 Team:Core/Infra Meta label for core/infra team labels Oct 31, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

assertThat(result.stdout(), containsString("Status : " + status));
}

private void waitForStop(String id, Duration timeout) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider replacing the service control calls in the windows service cli with sc.exe calls (and this wait in stop)? Then these changes could benefit users too, not just tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised this question on the slack thread too :)
I think this is a good idea, but I'd prefer to be sure.
My proposal is to make these changes in tests, then let them run for a week or so. If the issue is definitely fixed, we can either use sc.exe in most of the ProcrunCommands, or consider replacing them with direct calls to Win32 functions (should be rather easy, and would make the code for wait a lot nicer -- no external script needed).

@ldematte
Copy link
Contributor Author

This is the race condition in procrun:

In a normal execution, the thread serving the stop request and the thread doing/waiting for the JNI "stop" operation generate a reasonable sequence of SetServiceStatus calls:

[2025-10-31 14:14:16] [debug] ( javajni.c:1169) [ 3748] apxJavaWait -> WaitForSingleObject (0x0000000000000274, 60000 milliseconds) (-1=INFINITE)...
[2025-10-31 14:14:26] [debug] ( javajni.c:1094) [ 3244] Java worker thread finished org/elasticsearch/launcher/CliToolLauncher:close with status = 0
[2025-10-31 14:14:26] [debug] ( prunsrv.c:1184) [ 1036] reportServiceStatusE: dwCurrentState = 3 (SERVICE_STOP_PENDING), dwWin32ExitCode = 0, dwWaitHint = 60000 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:14:26] [debug] ( javajni.c:1172) [ 3748] apxJavaWait <- WaitForSingleObject (0x0000000000000274, 60000 milliseconds) = 0
[2025-10-31 14:14:26] [debug] ( prunsrv.c:1347) [ 3748] Java JNI stop worker finished.
[2025-10-31 14:14:26] [debug] ( prunsrv.c:1443) [ 3748] Waited 10563 timeout (60000)
[2025-10-31 14:14:26] [debug] ( prunsrv.c:1184) [ 3748] reportServiceStatusE: dwCurrentState = 3 (SERVICE_STOP_PENDING), dwWin32ExitCode = 0, dwWaitHint = 49437 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:14:26] [debug] ( prunsrv.c:2020) [ 1036] JVM destroyed.
[2025-10-31 14:14:26] [debug] ( prunsrv.c:1184) [ 1036] reportServiceStatusE: dwCurrentState = 1 (SERVICE_STOPPED), dwWin32ExitCode = 0, dwWaitHint = 0 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:14:26] [info]  ( prunsrv.c:2060) [11432] Run service finished.
[2025-10-31 14:14:26] [info]  ( prunsrv.c:1482) [ 3748] Service stop thread completed.
[2025-10-31 14:14:26] [info]  ( prunsrv.c:2320) [11432] Apache Commons Daemon procrun finished.

In many executions however the thread handling the stop request overlaps with the other threads, trying to re-set the status "back":

[2025-10-31 14:14:57] [debug] ( javajni.c:1169) [10748] apxJavaWait -> WaitForSingleObject (0x00000000000002AC, 60000 milliseconds) (-1=INFINITE)...
[2025-10-31 14:15:07] [debug] ( javajni.c:1094) [12720] Java worker thread finished org/elasticsearch/launcher/CliToolLauncher:close with status = 0
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1184) [ 2832] reportServiceStatusE: dwCurrentState = 3 (SERVICE_STOP_PENDING), dwWin32ExitCode = 0, dwWaitHint = 60000 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:15:07] [debug] ( javajni.c:1172) [10748] apxJavaWait <- WaitForSingleObject (0x00000000000002AC, 60000 milliseconds) = 0
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1347) [10748] Java JNI stop worker finished.
[2025-10-31 14:15:07] [debug] ( prunsrv.c:2020) [ 2832] JVM destroyed.
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1184) [ 2832] reportServiceStatusE: dwCurrentState = 1 (SERVICE_STOPPED), dwWin32ExitCode = 0, dwWaitHint = 0 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1443) [10748] Waited 10735 timeout (60000)
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1184) [10748] reportServiceStatusE: dwCurrentState = 3 (SERVICE_STOP_PENDING), dwWin32ExitCode = 0, dwWaitHint = 49265 milliseconds, dwServiceSpecificExitCode = 0.
[2025-10-31 14:15:07] [info]  ( prunsrv.c:2060) [ 4192] Run service finished.
[2025-10-31 14:15:07] [error] ( prunsrv.c:1207) [10748] Failed to set service status.
[2025-10-31 14:15:07] [error] ( prunsrv.c:1207) [10748] The handle is invalid.
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1461) [10748] Waiting for worker to die naturally...
[2025-10-31 14:15:07] [debug] ( prunsrv.c:1472) [10748] Worker finished gracefully in 0 milliseconds.
[2025-10-31 14:15:07] [info]  ( prunsrv.c:1482) [10748] Service stop thread completed.
[2025-10-31 14:15:07] [info]  ( prunsrv.c:2320) [ 4192] Apache Commons Daemon procrun finished.

Notice the SERVICE_STOP_PENDING after SERVICE_STOPPED.

This does not seem to have a visible effect; Windows recognizes the service has stopped, and invalidated the handle to it. However if it turns out that it does have bad effects we can fix it (feel free to ping me)

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ldematte ldematte enabled auto-merge (squash) November 3, 2025 10:50
@ldematte ldematte merged commit 45e38c1 into elastic:main Nov 4, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/CLI CLI utilities, scripts, and infrastructure Team:Core/Infra Meta label for core/infra team >test Issues or PRs that are addressing/adding tests test-windows Trigger CI checks on Windows v9.3.0

Projects

None yet

3 participants