Skip to content

Conversation

@moreati
Copy link
Member

@moreati moreati commented Oct 30, 2025

Since using select.select() in the first stage (to handle an obscure corner case where stdin appears to be non-blocking) there has been a report of first stage processes running for ever in an infinite loop - reading 0 bytes from stdin.

This attempts to do an end run around that problem by aborting if the bootstrap takes longer than a few seconds for any reason. Existing retry logic should deal with it as before.

refs #1306, #1307, #1348

@moreati moreati changed the title Investigate hanging first stage causes and fixes mitogen: Prevent hung bootstrap processes, add 5 second timeout to first stage Nov 5, 2025
@moreati moreati marked this pull request as ready for review November 5, 2025 20:15
@moreati
Copy link
Member Author

moreati commented Nov 7, 2025

An experiment to test my belief about how this patch will behave. The following code is fork_it.py a simplified version of the first stage

#!/usr/bin/env python3

import os
import signal
import sys

R, W = os.pipe()

if child_pid := os.fork():
    os.dup2(0, 100)
    os.dup2(R, 0)
    os.close(R)
    os.close(W)
    sys.stdout.write(f'parent: {os.getpid()}, child: {child_pid}\n')
    os.execl(sys.executable, sys.executable)
else:
    signal.alarm(2)
    s = sys.stdin.read() 
    with os.fdopen(W, 'wb', 0) as f:
        f.write(b'print("Hello world!")\n')

The simplifications include

  • The fork child sends a hard coded preamble (b'print("Hello world!")\n')
  • The fork parent discards whatever it reads from stdin, this is intentional.
  • stdin can be safely assumed to be blocking
  • There is no compression, or stage 0.
  • There is only a single pipe, since there's no need to cache a copy of the preamble
  • For convenience the SIGALM timeout is only 2 seconds.

What I expect:

  • When no data is provided to stdin

    • the call to sys.stdin.read() blocks
    • eventually the fork child is killed by SIGALRM
    • when the child terminates the python interpreter in the parent will receive EOF and also terminate itself
    • the simulated preamble will never be received, so won't execute or print anything
    • no lingering processess should be left
  • When data is provided

    • Hello world! will be printed
    • both parent and child will exit gracefully

Results appear to match this

mitogen git:(master) ✗ python3 fork_it.py        
parent: 11482, child: 11483mitogen git:(master) ✗ ps -p 11482,11483 
  PID TTY           TIME CMD
mitogen git:(master) ✗ python3 fork_it.py <<< foo
parent: 11725, child: 11726
Hello world!

@moreati moreati marked this pull request as draft November 27, 2025 10:01
Since using select.select() in the first stage (to handle an obscure corner
case where stdin appears to be non-blocking) there has been a report of first
stage processes running for ever in an infinite loop - reading 0 bytes from
stdin.

This attempts to do an end run around that problem by aborting if the
bootstrap takes longer than a few seconds for *any* reason. Existing retry
logic should deal with it as before.

5 seconds is a best guess at a suitable timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant