Fix Windows Threading Issues #385
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #385 +/- ##
===================================
- Coverage 89% 88% -0%
===================================
Files 37 37
Lines 2164 2184 +20
===================================
+ Hits 1918 1928 +10
- Misses 246 256 +10 🚀 New features to boost your workflow:
|
|
hi @FrsECM, thank you so much for the PR! Could you also add additional information about why setting |
Sure. By default, the number of worker is set to 1 and it crashes when we have multiples worker per devices on windows. |
|
I have to investigate a little more on another issue that cause [WinError 10022]. It seems that in some case (not completely reproductible), the socket is not ready to listen when we start uvicorn servers. |
|
@aniketmaurya did you have a look on the PR ? It seems to fix my problem on windows. It doesn't fix issue number #372 but it allows to make multiple worker on windows. |
aniketmaurya
left a comment
There was a problem hiding this comment.
LGTM! thanks for creating the fix 🚀
|
@aniketmaurya i got an idea for #372. Unlike on Linux, i discovered that inference workers were stopped first instead of uvicorn's. A fix is to join on inference worker on windows and to cleanly give a signal to uvicorn in order to end threads properly. I renamed some variables to be a little more explicit about their content. |
|
@FrsECM it seems like the Windows tests are stuck and have timed out. |
|
Hum. Curious. On my windows machine I don’t have issues.
I don’t know well the test stack.
I'll try to have a look tomorrow.
Envoyé à partir de Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Aniket Maurya ***@***.***>
Envoyé : Thursday, December 12, 2024 8:09:16 PM
À : Lightning-AI/LitServe ***@***.***>
Cc : François Ponchon ***@***.***>; Mention ***@***.***>
Objet : Re: [Lightning-AI/LitServe] Fix Windows Threading Issues (PR #385)
@FrsECM<https://github.com/FrsECM> it seems like the Windows tests are stuck and have timed out.
—
Reply to this email directly, view it on GitHub<#385 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGG5F7H2TWURUOQHJ6OVXRT2FHNNZAVCNFSM6AAAAABTBG2HB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMZZHAYDQMJZHA>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
@aniketmaurya i did a first try and suspected httpx>=0.28.0. But i was not up to date. On python 3.10.15, i have no issues on my computer, every tests are running. (litserve) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve\lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.10.15, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.6.2.post1, asyncio-0.25.0, cov-6.0.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 169 items
tests\e2e\test_e2e.py ............. [ 7%]
tests\test_auth.py .... [ 10%]
tests\test_batch.py ........... [ 16%]
tests\test_callbacks.py .... [ 18%]
tests\test_cli.py .. [ 20%]
tests\test_compression.py . [ 20%]
tests\test_connector.py s..ssssssss. [ 27%]
tests\test_docker_builder.py .. [ 28%]
tests\test_examples.py .......... [ 34%]
tests\test_form.py ... [ 36%]
tests\test_lit_server.py .........sssss..ss........... [ 53%]
tests\test_litapi.py .................. [ 64%]
tests\test_logger.py ........ [ 69%]
tests\test_logging.py ... [ 71%]
tests\test_loops.py ............. [ 78%]
tests\test_middlewares.py ... [ 80%]
tests\test_pydantic.py . [ 81%]
tests\test_readme.py s [ 81%]
tests\test_schema.py .. [ 82%]
tests\test_simple.py ....... [ 86%]
tests\test_specs.py .................. [ 97%]
tests\test_torch.py .s [ 98%]
tests\test_utils.py .. [100%]
================================================= 151 passed, 18 skipped, 50 warnings in 204.37s (0:03:24) ==================================================On python 3.11.11, i have no issues on my computer, every tests are running. (litserve3.11) PS C:\BUSCODE\packages\LitServe> python -m pytest
C:\Users\F296849\AppData\Local\miniforge3\envs\litserve3.11\Lib\site-packages\pytest_asyncio\plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
==================================================================== test session starts ====================================================================
platform win32 -- Python 3.11.11, pytest-8.3.4, pluggy-1.5.0
rootdir: C:\BUSCODE\packages\LitServe
configfile: pytest.ini
plugins: anyio-4.7.0, asyncio-0.25.0, cov-6.0.0, retry-1.6.3
asyncio: mode=Mode.STRICT, asyncio_default_fixture_loop_scope=None
collected 169 items
tests\e2e\test_e2e.py ............. [ 7%]
tests\test_auth.py .... [ 10%]
tests\test_batch.py ........... [ 16%]
tests\test_callbacks.py .... [ 18%]
tests\test_cli.py .. [ 20%]
tests\test_compression.py . [ 20%]
tests\test_connector.py s..ssssssss. [ 27%]
tests\test_docker_builder.py .. [ 28%]
tests\test_examples.py .......... [ 34%]
tests\test_form.py ... [ 36%]
tests\test_lit_server.py .........sssss..ss........... [ 53%]
tests\test_litapi.py .................. [ 64%]
tests\test_logger.py ........ [ 69%]
tests\test_logging.py ... [ 71%]
tests\test_loops.py ............. [ 78%]
tests\test_middlewares.py ... [ 80%]
tests\test_pydantic.py . [ 81%]
tests\test_readme.py s [ 81%]
tests\test_schema.py .. [ 82%]
tests\test_simple.py ....... [ 86%]
tests\test_specs.py .................. [ 97%]
tests\test_torch.py .s [ 98%]
tests\test_utils.py .. [100%]
================================================= 151 passed, 18 skipped, 48 warnings in 205.77s (0:03:25) ================================================== |
|
@aniketmaurya I tried to increase the timeout. The usage of threading instead of processes and the absence of uvloop on windows makes it slower. It may also depend on the runner. |
|
hi @FrsECM, Happy New Year! How is it going here? Please let me know if you need any help? |
|
Hi !
Happy new year !
Unfortunately I’m unable to reproduce what happens on the stuck CICD locally.
And I don’t have ressource to do this locally. (I can only use docker through wsl (eg. linux) with my company’s computer.
For me the fix would be to increase the timeout since the issue was present before my proposal to fix multi worker problems.
I suspect a problem with virtualization since my computer is not a VM.
Envoyé à partir de Outlook pour iOS<https://aka.ms/o0ukef>
…________________________________
De : Aniket Maurya ***@***.***>
Envoyé : Monday, January 6, 2025 12:06:13 AM
À : Lightning-AI/LitServe ***@***.***>
Cc : François Ponchon ***@***.***>; Mention ***@***.***>
Objet : Re: [Lightning-AI/LitServe] Fix Windows Threading Issues (PR #385)
hi @FrsECM<https://github.com/FrsECM>, Happy New Year! How is it going here? Please let me know if you need any help?
—
Reply to this email directly, view it on GitHub<#385 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGG5F7AL7MTOLQDFOMNAV4T2JG3GLAVCNFSM6AAAAABTBG2HB6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZRG44DCNZSHE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
This is still not fixed in latest litServe python package. Please fix it? It is such a trivial thing to fix for any dev. I have 100% reproducible example. |
The fix is implemented... Just i've been asked to check the CI that have been crashed before i started the implementation of the PR and i don't have hands on this. I suspect something with VMs. Hope i'll get some support from maintainer about that. |
@aniketmaurya, @Borda could you have a look on the CICD stuck ? For me the fix is to :
Thanks, |
|
@bhimrazy, is it possible to have a way to allow individual contributors to run pending checks manually ? |
Hi @FrsECM , I'm not sure about the access permissions on the CI, but I'll look into this issue over this weekend. I may need to test it in a separate PR since I don't have edit access to PRs either. Thanks for your patience! 😊 |
Sorry, guys! I did start working on this issue but had to pause since it requires a windows device, and mine is currently under repair due to some part replacements. Hoping to get back to this issue soon! Apologies again, for the late update—I thought I had already mentioned it (my bad)! That said, if this is a priority and someone else wants to take it up, please feel free to go ahead. |
|
Just merged with last main branch. Can someone run pipelines ? |
for more information, see https://pre-commit.ci
…FrsECM/LitServe into bugfix/windows_multiple_workers
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
|
I created a windows VM in order to understand a little more the issue... I proposed a new way to handle KeyboardInterrupt on Windows that could have less impacts on tests. # Catch the main pid and :
class LitLoop(_BaseLoop):
def __init__(self):
self._context = {}
self.server_pid = os.getpid()
....
# Kindly ask os to remove the server pid in case of keyboard interuption :
def kill(self):
with self._lock:
try:
print(f'Stop Server Requested - Kill parent pid [{self._server_pid}] from [{os.getpid()}]')
os.kill(self._server_pid,signal.SIGTERM)
except PermissionError:
# Access Denied because pid already killed...
return
Because on windows and macos the Ctrl+C is catched by Inference worker, we can just add an exception to handle the case : ...
except KeyboardInterrupt:
print(f"Keyboard Interruption - Kill server [{self.server_pid}]")
self.kill()
returnIt works on macos and windows to kill server with Ctrl+C : Could you please try to run the CICD to check if it positively affected tests ? Thanks ! |
aniketmaurya
left a comment
There was a problem hiding this comment.
@FrsECM LGTM! Thank you for updating the PR and your patience on this.
|
all good @FrsECM! Merging this after the CI turns green 🚀 |


Before submitting
As a user, i need to serve a model with multiple worker per device on a windows machine
What does this PR do?
This PR propose a fix to bug from Issue #384.
Uvicorn reference: Source
For example, with this simple server :
Bellow, you'll find the difference between before and after the PR :
Before the PR
After the PR
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃