Skip to content

Conversation

@Forgetyk
Copy link

Problem

When Mitogen is used with become: sudo, the first-stage bootstrap
sometimes receives EOF on stdin before the compressed preamble arrives.
Under Python 3.10 + Ansible 2.12 the sequence

MITO000
Traceback … TypeError: a bytes-like object is required, not 'NoneType'

is reproducible on every run.

  • _first_stage() calls fp.read(PREAMBLE_COMPRESSED_LEN)
    → returns None instead of bytes.
  • zlib.decompress(None) raises TypeError; controller reports
    “EOF on stream”.

Root cause: after writing the preamble the parent closes the write end of the
pipe immediately; with SSH compression (ssh -C) the tail of the deflate
stream can be lost, so the child sees premature EOF.


Fix

Read the preamble reliably:

remaining = PREAMBLE_COMPRESSED_LEN
chunks = []
deadline = time.time() + 10
while remaining:
    chunk = fp.read(remaining)
    if chunk:
        chunks.append(chunk)
        remaining -= len(chunk)
        continue
    if time.time() > deadline:
        raise RuntimeError('early EOF while reading preamble')
    time.sleep(0.02)
C = zlib.decompress(b''.join(chunks))
  • loops until all bytes are received (or timeout hits);
  • eliminates the None case, therefore no TypeError;
  • zero functional impact on normal fast paths.

Forgetyk added 2 commits July 10, 2025 15:49
If sudo child reads stdin faster than the parent writes, fp.read()
may return None, triggering TypeError in zlib.decompress().  Loop
until the full preamble arrives or a 10-second timeout expires.
parent: avoid early-EOF in sudo bootstrap, wait until full preamble is read
@Forgetyk Forgetyk closed this Jul 10, 2025
@moreati
Copy link
Member

moreati commented Jul 10, 2025

There may be a bug here. A minmal reproducer would be needed to confim it. The proposed fix is incorrect, as shown by the failing test run.

Problems with the patch (non-exhaustive)

  • Using import time at module level doesn't make time available in the first stage.
  • b'' is not available in Python < 2.6

Potential problems

  • I'm unsure if mitogen.parent.Connection.get_boot_command() can handle multiple levels of indentation.

Notes to self

  • os.fdopen(0, 'wb', 0) returns an _io.FileIO object (on Python 3.9)
  • io.FileIO is unbuffered, inherits methods from io.RawIOBase
  • io.RawIO.read(n)
    • Performs exactly one system call
    • Does not guarantee n bytes will be read
    • Returns None if fp is non-block, and no bytes are available.

@moreati moreati reopened this Jul 10, 2025
@moreati
Copy link
Member

moreati commented Jul 10, 2025

Scratch those notes to self.

  • _first_stage() uses os.fdopen(0,'rb'), not os.fdopen(0,'rb',0)
  • os.fdopen(0,'rb') returns an _io.BufferedReader
  • io.BufferedIOBase.read(n)
    • May perform multiple system calls
    • Also does not guarantee n bytes

@Forgetyk
Copy link
Author

I’ve pushed the latest amendments so that the CI suite now passes.
The remaining failures we saw (e.g. container-image pulls and legacy apt repository errors) were unrelated to the _first_stage() logic itself; they were caused by external factors (private GHCR images and Debian 10 archive URLs).

Please let me know if you’d like the fix split into smaller commits or if anything needs further clarification.

Thanks for the review!

@moreati
Copy link
Member

moreati commented Jul 14, 2025

There may be a bug here. A minmal reproducer would be needed to confim it.

I still haven't seen direct evidence that there is a bug. I'm making that a requirement - a reproducer that can form the basis of a regression test - before we move onto the design/finer points of any fix.

I'm looking at the Debian problem in CI.

@Forgetyk
Copy link
Author

Attached playbook startup log with error ansible_log.txt

@moreati
Copy link
Member

moreati commented Jul 14, 2025

Attached playbook startup log with error ansible_log.txt

As I said, I need a reproducer, a minimal playbook or Python script that exhibits the error. Something that anyone can run, and be adapted to become part of the test suite.

A log sample alone doesn't fufill that requirement.

@moreati
Copy link
Member

moreati commented Jul 16, 2025

Notes to self

Open questions

  1. Does fp.read(PREAMBLE_COMPRESSED_LEN) returning None indicate stdin has become non-blocking?
  2. If so, how did that occur?
  3. Are there other mechanisms that can cause it return None?

@moreati
Copy link
Member

moreati commented Jul 16, 2025

@Forgetyk to help with ruling out a few possibilities

  1. What version of Python is your target (iac-test-2) running?
  2. What is the output of uname -a on your target?

Alternatively, what is the output of ruinning a playbook like

- hosts: iac-test-2
  become: true
  gather_facts: true
  strategy: linear
  tasks:
    - {debug: {var: ansible_facts.architecture}}
    - {debug: {var: ansible_facts.distribution}}
    - {debug: {var: ansible_facts.distribution_major_version}}
    - {debug: {var: ansible_facts.distribution_version}}
    - {debug: {var: ansible_facts.os_family}}
    - {debug: {var: ansible_facts.python}}

@Forgetyk
Copy link
Author

It is more difficult there the occurrence of this problem. In production it appears on all servers, but not on new virtual machines. Now I'm trying to determine exactly what is causing the problem, maybe it's a combination of the ssh settings of the version itself and the use of ssh parameters.

As soon as I get the exact result, I will report back.

@moreati
Copy link
Member

moreati commented Jul 17, 2025

@Forgetyk to help with ruling out a few possibilities
1. What version of Python is your target (iac-test-2) running?
2. What is the output of uname -a on your target?

Please could you answer these for a host that has shown the failure. Even if it's not still doing so.

@Forgetyk
Copy link
Author

Was able to reproduce the problem and find the original source


Environment

Component Version
OS Linux 5247089-it57222 5.15.0-142-generic #152-Ubuntu SMP Mon May 19 10:54:31 UTC 2025 x86_64 x86_64 x86_64 x86_64 GNU/Linux
Python  3.10.12
sudo  1.9.9
OpenSSH  OpenSSH_8.9p1 Ubuntu-3ubuntu0.10,
ansible-core  2.12.10

All tests were run with strategy = mitogen_linear and pipelining = true.


Reproducer

  1. Target host – enable sudo I/O logging:

    Defaults log_output
    Defaults!/usr/bin/sudoreplay !log_output
    Defaults!/usr/local/bin/sudoreplay !log_output
    Defaults!/sbin/reboot !log_output
    Defaults loglinelen=0
    Defaults logfile=/var/log/sudo.log
    #Defaults:ansible !log_output, !log_input
    root    ALL=(ALL:ALL) ALL
    %admin ALL=(ALL) ALL
    %sudo   ALL=(ALL:ALL) ALL
    @includedir /etc/sudoers.d
    
  2. Target host – ssh settings:

     Protocol 2
     SyslogFacility AUTH
     LogLevel INFO
     LoginGraceTime 60
     PermitRootLogin no
     PermitEmptyPasswords no
     IgnoreRhosts yes
     HostbasedAuthentication no
     X11Forwarding no
     ClientAliveInterval 900
     ClientAliveCountMax 0
     AllowTcpForwarding no
     ChallengeResponseAuthentication no
     UsePAM yes
     Subsystem    sftp    /usr/lib/openssh/sftp-server
     PrintMotd no
     ListenAddress 0.0.0.
    
  3. Controller – standard Mitogen config (only the interesting bits):

    # ansible.cfg
    [defaults]
    strategy_plugins = /…/ansible_mitogen/plugins/strategy
    strategy         = mitogen_linear
    [ssh_connection]
    pipelining = true
  4. Inventory

    [target]
    iac-test-2 ansible_user=ansible ansible_python_interpreter=/usr/bin/python3
  5. Playbook

       become: yes
       gather_facts: no
       hosts:
     	- iac-test-2
       roles:
     	- some_role
       vars:
     	some_var: "{{ var }}"
  6. Result

    BlockingIOError: [Errno 35] Resource temporarily unavailable
    mitogen.core.py", line 2247, in write
    

    The crash appears within a few hundred ms, long before any module code is executed.


What is happening

  • sudo log_output/log_input spawns a pseudo‑terminal in front of the command it runs and copies all I/O through that PTY.
    The PTY’s kernel buffer is only ~64 KiB.
  • Mitogen opens both ends of its pipes O_NONBLOCK and uses an event loop.
  • During the bootstrap it pushes ~35 KiB of compressed preamble in one go; a couple of other writes follow immediately.
    When the PTY buffer fills up, write() returns EAGAINBlockingIOError, which is not retried and kills the stream.

Plain Ansible (no Mitogen) performs the same copy through sudo, but its os.write() is blocking, so it simply waits and survives.


Work‑around that mask the issue

Defaults:ansible !log_output, !log_input might be a solution for other similar problems, but it is forbidden within our production regulations.

Because disabling I/O logging is off‑limits for compliance, we need Mitogen to gracefully retry EAGAIN (or temporarily switch to blocking writes).

@moreati
Copy link
Member

moreati commented Jul 21, 2025

Thank you thats great dtective/dignostic work. I've raised #1306 based on it. I'm not sure if the ultimate should be in _first_stage() elsewhere, or both. I'll try to recreate the error locally, so we can have a test case.

@moreati
Copy link
Member

moreati commented Jul 21, 2025

Component Version
OS Linux 5247089-it57222 5.15.0-142-generic # 152-Ubuntu SMP Mon May 19 10:54:31 UTC 2025 x86_64 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.12
sudo 1.9.9
OpenSSH OpenSSH_8.9p1 Ubuntu-3ubuntu0.10
ansible-core 2.12.10

@Forgetyk could you confirm

  1. Is this the same on the Ansible controller and the targets? Or do the differ? In particular do the Python versions differ?

Edit: following is notes to myself:

This is Ubuntu 22.04 (Jammy), going by the kernel, sudo and OpenSSH version. The Python version also tallies with recent updates. At time of writing Python 3.10.18 is the latest 3.10 release https://docs.python.org/release/3.10.18/whatsnew/changelog.html

@Forgetyk
Copy link
Author

All specified versions are the same on the controller and target hosts

@moreati
Copy link
Member

moreati commented Jul 22, 2025

I'm still trying to reproduce this under #1306

  1. What version of Mitogen are you using?
  2. If it is < 0.3.24 please could you try Mitogen 0.3.24 or 0.3.25b1?

@Forgetyk
Copy link
Author

I have used versions 0.3.24 and 0.3.25a3. on both versions the problem is reproducible. I can test 0.3.25b1 if needed.

@moreati
Copy link
Member

moreati commented Jul 22, 2025

I can test 0.3.25b1 if needed.

No need. There was a moderate chance 0.3.24 contained releavant changes, a slim chance one of the alphas did. There's no chance the change(s) in b1 is/are relevant. Thanks for confirming.

@moreati
Copy link
Member

moreati commented Aug 15, 2025

@Forgetyk I've create #1307 with an alternative fix, and regression tests. I've you get a chance please could you try it. Thanks, and sorry for the wait. Alex

@moreati
Copy link
Member

moreati commented Aug 20, 2025

Superceded by #1307

@moreati moreati closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants