Skip to content

Conversation

@mhartmay
Copy link
Contributor

@mhartmay mhartmay commented Dec 4, 2025

The current implementation can cause an infinite loop, leading to a process that hangs and consumes 100% CPU. This occurs because the EOF condition is not handled properly, resulting in repeated select(...) and read(...) calls.

The fix is to properly handle the EOF condition and break out of the loop when it occurs.

Fixes: #1348

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 4, 2025

Note: I did not test whether this actually fixes the problem since I did not reproduce the problem, but it's very likely as I had similar issues in the past.

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 4, 2025

Question: Should we raise an error if not all data could be read?

Update: What speaks against raising an error is that it would be different to the "original code" behavior (before #1307)

The original code reads:

        fp=os.fdopen(0,'rb')
        C=zlib.decompress(fp.read(PREAMBLE_COMPRESSED_LEN))
        fp.close()

fdopen returns an BufferedReader and the documentation for fdopen(..).read() says:

Fewer bytes may be returned than requested. [1]

So even with the old version it was possible that less than expected data was read and no exception was raised, so we should probably keep to that behavior for now.

[1] https://docs.python.org/3/library/io.html#io.BufferedIOBase.read

@moreati
Copy link
Member

moreati commented Dec 5, 2025

Question: Should we raise an error if not all data could be read?

The call to zlib.decompress() raises an exception if the data is truncated.

C=''.encode()
while int(sys.argv[3])-len(C)and select.select([0],[],[]):C+=os.read(0,int(sys.argv[3])-len(C))
n=int(sys.argv[3]);C=''.encode();V='V'
while n>len(C) and V:select.select([0],[],[]);V=os.read(0,n-len(C));C+=V
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fd 0 may be in non-blocking mode, therefore a zero-length read does not necessarily indicate EOF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we get an BlockingIOError [1] in that case? At least the documentation of os.read [2] reads:

Return a bytestring containing the bytes read. If the end of the file referred to by fd has been reached, an empty bytes object is returned.

[1] https://docs.python.org/3/library/exceptions.html#BlockingIOError
[2] https://docs.python.org/3/library/os.html#os.read

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we use select.select(...) in order to avoid that situation, no? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should I have said None. On Python 3 f.read(n)

may either raise BlockingIOError or return None if no data is available. io implementations return None.
-- https://docs.python.org/3/library/io.html#io.BufferedIOBase.read

Copy link
Contributor Author

@mhartmay mhartmay Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we use os.read(...) not the BufferedIO read (e.g. open(...).read()).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I need to need resh my memory of #1306 and #1307

Copy link
Contributor Author

@mhartmay mhartmay Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And instead of doing the SIGALRM trick we could use the timeout parameter of select [1] and check if a FD is ready to read or not... but that results in a little bit more of code.

Currently, we know that we can read data (even if it's the EOF) from one FD when the select call returns as we only pass one FD to the select function.

https://docs.python.org/3/library/select.html#select.select

Copy link
Contributor Author

@mhartmay mhartmay Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I need to need resh my memory of #1306 and #1307

fp.read() <--- that's the bufferedIO read != os.read(..)

and the original code used BufferedIO as os.fdopen is more or less an alias for open(...) [1] :)
fp=os.fdopen(0,'rb') C=zlib.decompress(fp.read(PREAMBLE_COMPRESSED_LEN))

[1] https://docs.python.org/3/library/os.html#os.fdopen

@mhartmay mhartmay requested a review from moreati December 5, 2025 14:42
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 5, 2025

@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.

Edit: Maybe even make the timeout value configurable.

@mhartmay mhartmay force-pushed the possible-fix branch 2 times, most recently from 62dff70 to ae77274 Compare December 8, 2025 08:45
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.

Edit: Maybe even make the timeout value configurable.

I've added some preliminary steps to make it configurable and added a test for the timeout situation.

@moreati
Copy link
Member

moreati commented Dec 8, 2025

By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.

To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.

I'm currently doing my experiments to pin down how {fobj,os}.read() behave in different circumstancess and across supported Python versions.

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

Yep, right now I'm experimenting, especially in regard with the tests. Because it would be great if we have minimal tests for blocking and non-blocking STDIO. So the timeout test should be useful in all cases even for your SIGALRM solution :)

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.

Fine with me :) Reason why I'm thinking about making the timeout configurable is that the halting problem is undecidable and the timeout depends on the system/environment.

To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.

Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?

I'm currently doing my experiments to pin down how {fobj,os}.read() behave in different circumstancess and across supported Python versions.

Good!

@moreati
Copy link
Member

moreati commented Dec 8, 2025

Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?

One before and after for a complete PR will be fine. SSH command size is what I minimize.

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

@moreati I've now added two test cases, one that causes an timeout and the other that causes the EOF situation similar to the original bug.

@mhartmay mhartmay marked this pull request as draft December 8, 2025 10:01
@mhartmay mhartmay force-pushed the possible-fix branch 5 times, most recently from f5eddd8 to 2dc337d Compare December 8, 2025 11:55
@moreati
Copy link
Member

moreati commented Dec 8, 2025

Based on moreati/nonblock_lab@c4938b9

  • other end of a fifo/pipe is closed (never opened)
    • Python 2.x and Pytrhon 3.x <x>.read() both return empty byte string
  • other end is open, but empty
    • Python 2.x <x>.read() raises an EAGAIN/EWOULDBLOCK derived exception
    • Python 3.x fobj.read() returns None
    • Python 2.x os.read() raises BlockingIOError

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

@moreati
I've now added one test case that runs the full stage code (tests.first_stage_test.CommandLineTest.test_stage).

I'm going to add multiple other test cases, e.g.
stdin is a blocking FD, stdin is non-blocking, different pipe sizes, and maybe even passing a FD reading from a file.

e.g.

with open("input.file", "rb") as f: 
    proc = subprocess.Popen(
        ...
        stdin=f,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    out, err = proc.communicate()

Do you have other ideas? Not sure if the file case if even needed...

Edit: To me it seems like a good idea to add more test cases to first_stage_test.py as we have everything under control that we want to test and it's easier to debug than a larger stack.

@mhartmay mhartmay force-pushed the possible-fix branch 2 times, most recently from 7f3d46b to da7b213 Compare December 8, 2025 16:06
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 8, 2025

@moreati
Is there a difference to run test first_stage_test via ./run_tests.sh or manually via Python unittest executor?

That's the command line I'm using during development:

$ PYTHONPATH=$(pwd)/tests:$PYTHONPATH python -m unittest -v tests.first_stage_test.CommandLineTest
test_eof_too_early (tests.first_stage_test.CommandLineTest.test_eof_too_early)
The boot command should write an ECO marker to stdout, read the ... ok
test_stage (tests.first_stage_test.CommandLineTest.test_stage)
Test that first stage works ... ok
test_stdin_blocking (tests.first_stage_test.CommandLineTest.test_stdin_blocking)
Test that first stage works with blocking STDIN ... ok
test_stdin_non_blocking (tests.first_stage_test.CommandLineTest.test_stdin_non_blocking)
Test that first stage works with non-blocking STDIN ... ok
test_timeout_error (tests.first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the ... ok
test_valid_syntax (tests.first_stage_test.CommandLineTest.test_valid_syntax)
Test valid syntax ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.789s

OK

@mhartmay mhartmay force-pushed the possible-fix branch 2 times, most recently from dfb4447 to 7bcdd55 Compare December 8, 2025 16:37
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 9, 2025

@moreati
Two questions:

  1. Is it okay, if I pick your commit as fix for the timeout situation in this PR (or how should we proceed)?
  2. Can we revert the .ci/mitogen_tests.py changes from commit [1] as we now have a minimal test cases that provides coverage of the non-blocking FD case?

IMO, as soon as the test cases are stable and the small coding style nits are being resolved, it might be a good idea to merge the tests, together with the EOF file fix (but without my timeout fix/changes). What do you think?

[1] 85d6046

@moreati
Copy link
Member

moreati commented Dec 9, 2025

I'm still getting my head around things. So I'm not sure how to proceed yet

  1. Currently I don't think my SIGALRMattempt is the right fix, particularly given feedback from rda0 that it left idle (orphaned?) processes in hanging process with 100% CPU since 0.3.28 #1348 (comment)
  2. Possibly, but at the moment it's providing coverage in the Ansible tests as well and thats where the original error report came from.

I think your PR, or something evolved from it, is the best direction - particularly your tests.

@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 9, 2025

@moreati
I couldn’t come up with a non-racy solution, so I implemented custom dummy connectors that use blocking and non-blocking I/O for STDIN/STDOUT.

There is still one unstable test left:

FAIL: test_timeout_error (first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/mitogen/mitogen/tests/first_stage_test.py", line 249, in test_timeout_error
    self.assertIn(
    ~~~~~~~~~~~~~^
        b("TimeoutError"),
        ^^^^^^^^^^^^^^^^^^
        stderr,
        ^^^^^^^
    )
    ^
AssertionError: b'TimeoutError' not found in b'Traceback (most recent call last):\n  File "<string>", line 1, in <module>\n    import sys;sys.path=[p for p in sys.path if p];import binascii,os,select,zlib;exec(zlib.decompress(binascii.a2b_base64(sys.argv[1]),-15))\n                                                                                  ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "<string>", line 27, in <module>\nzlib.error: Error -5 while decompressing data: incomplete or truncated stream\n'

For 'some reasons' I do not get the error -5. I can change the expectation, but I would like to be able to understand and reproduce the issue on my local system.

@mhartmay mhartmay force-pushed the possible-fix branch 4 times, most recently from ce370f3 to efd09f8 Compare December 9, 2025 15:27
@moreati
Copy link
Member

moreati commented Dec 9, 2025

I've begun trying variations of your tests in #1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?

Signed-off-by: Marc Hartmayer <[email protected]>
This makes it easier to add more tests and the test description is now used by
the test runner.

Signed-off-by: Marc Hartmayer <[email protected]>
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 10, 2025

I've begun trying variations of your tests in #1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?

Sure, just send me an email.

I've added more tests and improved the tests, so most scenarios should now be covered.

 + test_stdin_non_blocking
 + test_stdin_blocking

Signed-off-by: Marc Hartmayer <[email protected]>
The current implementation can cause an infinite loop, leading to a process that
hangs and consumes 100% CPU. This occurs because the EOF condition is not
handled properly, resulting in repeated select(...) and read(...) calls.

The fix is to properly handle the EOF condition and break out of the loop when
it occurs.

-SSH command size: 822
+SSH command size: 838

-mitogen.parent        98746  96.4KiB  51215 50.0KiB 51.9%  12922 12.6KiB 13.1%
+mitogen.parent        98827  96.5KiB  51219 50.0KiB 51.8%  12942 12.6KiB 13.1%

Fixes: mitogen-hq#1348
Signed-off-by: Marc Hartmayer <[email protected]>
Do not wait/block forever for data to be read.

Add a test for this.

The test can be run using the following command:

PYTHONPATH=$(pwd)/tests:$PYTHONPATH python -m unittest -v tests.first_stage_test

-SSH command size: 838
+SSH command size: 894

                         Original           Minimized           Compressed
-mitogen.parent        98827  96.5KiB  51219 50.0KiB 51.8%  12942 12.6KiB 13.1%
+mitogen.parent        99034  96.7KiB  51295 50.1KiB 51.8%  12970 12.7KiB 13.1%

Signed-off-by: Marc Hartmayer <[email protected]>
-SSH command size: 894
+SSH command size: 905

                         Original           Minimized           Compressed
-mitogen.parent        99034  96.7KiB  51295 50.1KiB 51.8%  12970 12.7KiB 13.1%
+mitogen.parent        99212  96.9KiB  51385 50.2KiB 51.8%  12999 12.7KiB 13.1%

Signed-off-by: Marc Hartmayer <[email protected]>
Bail out if STDIN or STDOUT is closed/not available as it is used for the
communication with the parent process.

Signed-off-by: Marc Hartmayer <[email protected]>
If STDERR is not available, ignore the OSError since it's a non-critical error.

Note: This change is not necessary as the exception message would be print on
      stderr and stderr is already closed and the exit status of the forked child
      process is not checked yet.

Signed-off-by: Marc Hartmayer <[email protected]>
@mhartmay
Copy link
Contributor Author

mhartmay commented Dec 10, 2025

@moreati I've polished the commits/tests a little so they should be stable on all environments we support (I did not test with Windows Subsystem for Linux (WSL)).

The whole series results in the following size changes:

-SSH command size: 822
+SSH command size: 925

                         Original           Minimized           Compressed
-mitogen.parent        98746  96.4KiB  51215 50.0KiB 51.9%  12922 12.6KiB 13.1%
+mitogen.parent        99311  97.0KiB  51441 50.2KiB 51.8%  13009 12.7KiB 13.1%

Update:
Can you please push your commit 733f4bc as it makes sense anyway. When it's pushed I can add some comments to the first stage.

Covered test scenarios:

@moreati
Copy link
Member

moreati commented Dec 10, 2025

Can you please push your commit 733f4bc as it makes sense anyway. When it's pushed I can add some comments to the first stage.

Done in #1393

@moreati
Copy link
Member

moreati commented Dec 10, 2025

Thanks for all this. When you're ready I'll do a full review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hanging process with 100% CPU since 0.3.28

2 participants