Skip to content

Fix Windows Threading Issues #385

Merged
aniketmaurya merged 33 commits intoLightning-AI:mainfrom
FrsECM:bugfix/windows_multiple_workers
Apr 30, 2025
Merged

Fix Windows Threading Issues #385
aniketmaurya merged 33 commits intoLightning-AI:mainfrom
FrsECM:bugfix/windows_multiple_workers

Conversation

@FrsECM
Copy link
Copy Markdown
Contributor

@FrsECM FrsECM commented Dec 4, 2024

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)

⚠️ How does this PR impact the user? ⚠️
As a user, i need to serve a model with multiple worker per device on a windows machine

What does this PR do?

This PR propose a fix to bug from Issue #384.

Uvicorn reference: Source

For example, with this simple server :

import litserve as ls
class SimpleLitAPI(ls.LitAPI):
    def setup(self, device):
        self.model1 = lambda x: x**2
    def decode_request(self, request):
        return request["input"] 
    def predict(self, x):
        squared = self.model1(x)
        output = squared
        return {"output": output}

    def encode_response(self, output):
        return {"output": output} 

# (STEP 2) - START THE SERVER
if __name__ == "__main__":
    # scale with advanced features (batching, GPUs, etc...)
    server = ls.LitServer(SimpleLitAPI(), accelerator="auto", max_batch_size=1,workers_per_device=2)
    server.run(port=8000)

Bellow, you'll find the difference between before and after the PR :

Before the PR
(litserve_win) PS > python .\dummy_server.py
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\site-packages\litserve\server.py:475: UserWarning: Windows does not support forking. Using threads api_server_worker_type will be set to 'thread'
  warnings.warn(
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
Swagger UI is available at http://0.0.0.0:8000/docs
INFO:     Started server process [35312]
INFO:     Started server process [35312]
INFO:     Waiting for application startup.
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Application startup complete.
Accept failed on a socket
socket: <asyncio.TransportSocket fd=608, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('0.0.0.0', 8000)>
Traceback (most recent call last):
  File "C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\asyncio\proactor_events.py", line 841, in loop
    f = self._proactor.accept(sock)
  File "C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\asyncio\windows_events.py", line 563, in accept
    self._register_with_iocp(listener)
  File "C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\asyncio\windows_events.py", line 732, in _register_with_iocp      
    _overlapped.CreateIoCompletionPort(obj.fileno(), self._iocp, 0, 0)
OSError: [WinError 87] Paramètre incorrect
Task exception was never retrieved
future: <Task finished name='Task-7' coro=<IocpProactor.accept.<locals>.accept_coro() done, defined at C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\asyncio\windows_events.py:577> exception=OSError(22, 'L’opération d’entrée/sortie a été abandonnée en raison de l’arrêt d’un thread ou à la demande d’une application', None, 995, None)>
Traceback (most recent call last):
  File "C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\asyncio\windows_events.py", line 580, in accept_coro
    await future
OSError: [WinError 995] L’opération d’entrée/sortie a été abandonnée en raison de l’arrêt d’un thread ou à la demande d’une application
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
Setup complete for worker 1.
Setup complete for worker 0.
After the PR
(litserve_win) PS > python .\dummy_server.py
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
C:\Users\F296849\AppData\Local\miniforge3\envs\iris2_win\lib\site-packages\litserve\server.py:475: UserWarning: Windows does not support forking. Using threads api_server_worker_type will be set to 'thread'
  warnings.warn(
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
Swagger UI is available at http://0.0.0.0:8000/docs
INFO:     Started server process [24816]
INFO:     Waiting for application startup.
INFO:     Started server process [24816]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Application startup complete.
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
Setup complete for worker 0.
uvloop is not installed. Falling back to the default asyncio event loop. Please install uvloop for better performance using `pip install uvloop`.
uvloop is not installed. Falling back to the default asyncio event loop.
Setup complete for worker 1.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 4, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 10 lines in your changes missing coverage. Please review.

Project coverage is 88%. Comparing base (7e01984) to head (2daba46).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #385   +/-   ##
===================================
- Coverage    89%    88%   -0%     
===================================
  Files        37     37           
  Lines      2164   2184   +20     
===================================
+ Hits       1918   1928   +10     
- Misses      246    256   +10     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@aniketmaurya
Copy link
Copy Markdown
Collaborator

hi @FrsECM, thank you so much for the PR! Could you also add additional information about why setting config.workers fixes this issue for reference?

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 4, 2024

hi @FrsECM, thank you so much for the PR! Could you also add additional information about why setting config.workers fixes this issue for reference?

Sure.
It's to make sure the code go through this portion of uvicorn :
Kludex/uvicorn@858f1c5

By default, the number of worker is set to 1 and it crashes when we have multiples worker per devices on windows.

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 5, 2024

I have to investigate a little more on another issue that cause [WinError 10022].

It seems that in some case (not completely reproductible), the socket is not ready to listen when we start uvicorn servers.
A workarround is to force listening before setting up uvicorn servers. I don't have any idea about why it behaves like that.
If you have any idea.

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 6, 2024

@aniketmaurya did you have a look on the PR ?

It seems to fix my problem on windows.

It doesn't fix issue number #372 but it allows to make multiple worker on windows.

Copy link
Copy Markdown
Collaborator

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for creating the fix 🚀

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 8, 2024

@aniketmaurya i got an idea for #372.

Unlike on Linux, i discovered that inference workers were stopped first instead of uvicorn's.

A fix is to join on inference worker on windows and to cleanly give a signal to uvicorn in order to end threads properly.
In order to do it, i need to keep uvicorn's server in a class variable.
It's because currently only workers are returned by method _start_server().

I renamed some variables to be a little more explicit about their content.

@FrsECM FrsECM changed the title Fix bug on windows with uvicorn when multiple workers. Fix Windows Threading Issues Dec 8, 2024
@aniketmaurya aniketmaurya self-requested a review December 8, 2024 23:39
@aniketmaurya
Copy link
Copy Markdown
Collaborator

@FrsECM it seems like the Windows tests are stuck and have timed out.

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 12, 2024 via email

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 13, 2024

@aniketmaurya i did a first try and suspected httpx>=0.28.0. But i was not up to date.

On python 3.10.15, i have no issues on my computer, every tests are running.

(litserve) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve\lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.10.15, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.6.2.post1, asyncio-0.25.0, cov-6.0.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 169 items

tests\e2e\test_e2e.py .............                                                                                                                    [  7%]
tests\test_auth.py ....                                                                                                                                [ 10%]
tests\test_batch.py ...........                                                                                                                        [ 16%]
tests\test_callbacks.py ....                                                                                                                           [ 18%]
tests\test_cli.py ..                                                                                                                                   [ 20%] 
tests\test_compression.py .                                                                                                                            [ 20%]
tests\test_connector.py s..ssssssss.                                                                                                                   [ 27%] 
tests\test_docker_builder.py ..                                                                                                                        [ 28%]
tests\test_examples.py ..........                                                                                                                      [ 34%]
tests\test_form.py ...                                                                                                                                 [ 36%]
tests\test_lit_server.py .........sssss..ss...........                                                                                                 [ 53%]
tests\test_litapi.py ..................                                                                                                                [ 64%]
tests\test_logger.py ........                                                                                                                          [ 69%]
tests\test_logging.py ...                                                                                                                              [ 71%] 
tests\test_loops.py .............                                                                                                                      [ 78%]
tests\test_middlewares.py ...                                                                                                                          [ 80%]
tests\test_pydantic.py .                                                                                                                               [ 81%]
tests\test_readme.py s                                                                                                                                 [ 81%] 
tests\test_schema.py ..                                                                                                                                [ 82%]
tests\test_simple.py .......                                                                                                                           [ 86%]
tests\test_specs.py ..................                                                                                                                 [ 97%]
tests\test_torch.py .s                                                                                                                                 [ 98%]
tests\test_utils.py ..                                                                                                                                 [100%]
================================================= 151 passed, 18 skipped, 50 warnings in 204.37s (0:03:24) ==================================================

On python 3.11.11, i have no issues on my computer, every tests are running.

(litserve3.11) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve3.11\Lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.7.0, asyncio-0.25.0, cov-6.0.0, retry-1.6.3
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 169 items

tests\e2e\test_e2e.py .............                                                                                                                    [  7%]
tests\test_auth.py ....                                                                                                                                [ 10%]
tests\test_batch.py ...........                                                                                                                        [ 16%]
tests\test_callbacks.py ....                                                                                                                           [ 18%]
tests\test_cli.py ..                                                                                                                                   [ 20%]
tests\test_compression.py .                                                                                                                            [ 20%]
tests\test_connector.py s..ssssssss.                                                                                                                   [ 27%] 
tests\test_docker_builder.py ..                                                                                                                        [ 28%] 
tests\test_examples.py ..........                                                                                                                      [ 34%]
tests\test_form.py ...                                                                                                                                 [ 36%]
tests\test_lit_server.py .........sssss..ss...........                                                                                                 [ 53%]
tests\test_litapi.py ..................                                                                                                                [ 64%]
tests\test_logger.py ........                                                                                                                          [ 69%]
tests\test_logging.py ...                                                                                                                              [ 71%] 
tests\test_loops.py .............                                                                                                                      [ 78%]
tests\test_middlewares.py ...                                                                                                                          [ 80%]
tests\test_pydantic.py .                                                                                                                               [ 81%]
tests\test_readme.py s                                                                                                                                 [ 81%] 
tests\test_schema.py ..                                                                                                                                [ 82%]
tests\test_simple.py .......                                                                                                                           [ 86%]
tests\test_specs.py ..................                                                                                                                 [ 97%]
tests\test_torch.py .s                                                                                                                                 [ 98%]
tests\test_utils.py ..                                                                                                                                 [100%] 
================================================= 151 passed, 18 skipped, 48 warnings in 205.77s (0:03:25) ==================================================

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Dec 13, 2024

@aniketmaurya I tried to increase the timeout. The usage of threading instead of processes and the absence of uvloop on windows makes it slower. It may also depend on the runner.
But i need your agreement to rerun the ci.

@FrsECM FrsECM requested a review from Borda December 13, 2024 09:48
@aniketmaurya
Copy link
Copy Markdown
Collaborator

hi @FrsECM, Happy New Year! How is it going here? Please let me know if you need any help?

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Jan 6, 2025 via email

@RaiaN
Copy link
Copy Markdown

RaiaN commented Jan 23, 2025

This is still not fixed in latest litServe python package. Please fix it? It is such a trivial thing to fix for any dev. I have 100% reproducible example.

@FrsECM FrsECM requested a review from tchaton as a code owner January 24, 2025 07:47
@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Jan 24, 2025

This is still not fixed in latest litServe python package. Please fix it? It is such a trivial thing to fix for any dev. I have 100% reproducible example.

The fix is implemented... Just i've been asked to check the CI that have been crashed before i started the implementation of the PR and i don't have hands on this. I suspect something with VMs.

Hope i'll get some support from maintainer about that.

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Feb 5, 2025

This is still not fixed in latest litServe python package. Please fix it? It is such a trivial thing to fix for any dev. I have 100% reproducible example.

The fix is implemented... Just i've been asked to check the CI that have been crashed before i started the implementation of the PR and i don't have hands on this. I suspect something with VMs.

Hope i'll get some support from maintainer about that.

@aniketmaurya, @Borda could you have a look on the CICD stuck ?
As i said, it's not a commit from my PR that generated the issue, it started before my first commit.

2bbed42

For me the fix is to :

  • fix the CI to test on python 3.10 like before
  • change the timeout

Thanks,

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Feb 28, 2025

@bhimrazy, is it possible to have a way to allow individual contributors to run pending checks manually ?
Thanks a lot.

@bhimrazy
Copy link
Copy Markdown
Collaborator

@bhimrazy, is it possible to have a way to allow individual contributors to run pending checks manually ? Thanks a lot.

Hi @FrsECM , I'm not sure about the access permissions on the CI, but I'll look into this issue over this weekend. I may need to test it in a separate PR since I don't have edit access to PRs either.

Thanks for your patience! 😊

@Borda Borda mentioned this pull request Mar 4, 2025
4 tasks
@bhimrazy
Copy link
Copy Markdown
Collaborator

bhimrazy commented Mar 31, 2025

@bhimrazy, is it possible to have a way to allow individual contributors to run pending checks manually ? Thanks a lot.

Hi @FrsECM , I'm not sure about the access permissions on the CI, but I'll look into this issue over this weekend. I may need to test it in a separate PR since I don't have edit access to PRs either.

Thanks for your patience! 😊

Sorry, guys! I did start working on this issue but had to pause since it requires a windows device, and mine is currently under repair due to some part replacements. Hoping to get back to this issue soon!

Apologies again, for the late update—I thought I had already mentioned it (my bad)!

That said, if this is a priority and someone else wants to take it up, please feel free to go ahead.

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Apr 8, 2025

Just merged with last main branch. Can someone run pipelines ?
Thanks

@FrsECM
Copy link
Copy Markdown
Contributor Author

FrsECM commented Apr 29, 2025

I created a windows VM in order to understand a little more the issue...

I proposed a new way to handle KeyboardInterrupt on Windows that could have less impacts on tests.
Instead of joining on different process what can lock test in windows case, i handle now the KeyboardInterrupt Exception in Loops :

# Catch the main pid and  :
class LitLoop(_BaseLoop):
    def __init__(self):
        self._context = {}
        self.server_pid = os.getpid()

    ....
   # Kindly ask os to remove the server pid in case of keyboard interuption :
    def kill(self):
        with self._lock:
            try:
                print(f'Stop Server Requested - Kill parent pid [{self._server_pid}] from [{os.getpid()}]')
                os.kill(self._server_pid,signal.SIGTERM)
            except PermissionError:
                # Access Denied because pid already killed...
                return

Because on windows and macos the Ctrl+C is catched by Inference worker, we can just add an exception to handle the case :

        ...
        except KeyboardInterrupt:
            print(f"Keyboard Interruption - Kill server [{self.server_pid}]")
            self.kill()
            return

It works on macos and windows to kill server with Ctrl+C :
image

Could you please try to run the CICD to check if it positively affected tests ?
@Borda, @justusschock, @aniketmaurya

Thanks !

Copy link
Copy Markdown
Collaborator

@aniketmaurya aniketmaurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrsECM LGTM! Thank you for updating the PR and your patience on this.

@aniketmaurya aniketmaurya linked an issue Apr 30, 2025 that may be closed by this pull request
@aniketmaurya
Copy link
Copy Markdown
Collaborator

all good @FrsECM! Merging this after the CI turns green 🚀

@aniketmaurya
Copy link
Copy Markdown
Collaborator

merging this since tested and worked on GPUs

image

@aniketmaurya aniketmaurya merged commit c7d8d2f into Lightning-AI:main Apr 30, 2025
20 of 21 checks passed
@aniketmaurya aniketmaurya linked an issue Apr 30, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Windows] OSError when multiple workers [Windows] Server Hangs while closing

6 participants