-
Notifications
You must be signed in to change notification settings - Fork 202
parent: avoid early-EOF in sudo bootstrap, wait until full preamble is read #1299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If sudo child reads stdin faster than the parent writes, fp.read() may return None, triggering TypeError in zlib.decompress(). Loop until the full preamble arrives or a 10-second timeout expires.
parent: avoid early-EOF in sudo bootstrap, wait until full preamble is read
|
There may be a bug here. A minmal reproducer would be needed to confim it. The proposed fix is incorrect, as shown by the failing test run. Problems with the patch (non-exhaustive)
Potential problems
Notes to self
|
|
Scratch those notes to self.
|
|
I’ve pushed the latest amendments so that the CI suite now passes. Please let me know if you’d like the fix split into smaller commits or if anything needs further clarification. Thanks for the review! |
I still haven't seen direct evidence that there is a bug. I'm making that a requirement - a reproducer that can form the basis of a regression test - before we move onto the design/finer points of any fix. I'm looking at the Debian problem in CI. |
|
Attached playbook startup log with error ansible_log.txt |
As I said, I need a reproducer, a minimal playbook or Python script that exhibits the error. Something that anyone can run, and be adapted to become part of the test suite. A log sample alone doesn't fufill that requirement. |
|
Notes to self
Open questions
|
|
@Forgetyk to help with ruling out a few possibilities
Alternatively, what is the output of ruinning a playbook like - hosts: iac-test-2
become: true
gather_facts: true
strategy: linear
tasks:
- {debug: {var: ansible_facts.architecture}}
- {debug: {var: ansible_facts.distribution}}
- {debug: {var: ansible_facts.distribution_major_version}}
- {debug: {var: ansible_facts.distribution_version}}
- {debug: {var: ansible_facts.os_family}}
- {debug: {var: ansible_facts.python}} |
|
It is more difficult there the occurrence of this problem. In production it appears on all servers, but not on new virtual machines. Now I'm trying to determine exactly what is causing the problem, maybe it's a combination of the ssh settings of the version itself and the use of ssh parameters. As soon as I get the exact result, I will report back. |
Please could you answer these for a host that has shown the failure. Even if it's not still doing so. |
|
Was able to reproduce the problem and find the original source Environment
All tests were run with Reproducer
What is happening
Plain Ansible (no Mitogen) performs the same copy through sudo, but its Work‑around that mask the issue
Because disabling I/O logging is off‑limits for compliance, we need Mitogen to gracefully retry |
|
Thank you thats great dtective/dignostic work. I've raised #1306 based on it. I'm not sure if the ultimate should be in |
@Forgetyk could you confirm
Edit: following is notes to myself: This is Ubuntu 22.04 (Jammy), going by the kernel, sudo and OpenSSH version. The Python version also tallies with recent updates. At time of writing Python 3.10.18 is the latest 3.10 release https://docs.python.org/release/3.10.18/whatsnew/changelog.html |
|
All specified versions are the same on the controller and target hosts |
|
I'm still trying to reproduce this under #1306
|
|
I have used versions 0.3.24 and 0.3.25a3. on both versions the problem is reproducible. I can test 0.3.25b1 if needed. |
No need. There was a moderate chance 0.3.24 contained releavant changes, a slim chance one of the alphas did. There's no chance the change(s) in b1 is/are relevant. Thanks for confirming. |
|
Superceded by #1307 |
Problem
When Mitogen is used with
become: sudo, the first-stage bootstrapsometimes receives EOF on stdin before the compressed preamble arrives.
Under Python 3.10 + Ansible 2.12 the sequence
is reproducible on every run.
_first_stage()callsfp.read(PREAMBLE_COMPRESSED_LEN)→ returns
Noneinstead ofbytes.zlib.decompress(None)raisesTypeError; controller reports“EOF on stream”.
Root cause: after writing the preamble the parent closes the write end of the
pipe immediately; with SSH compression (
ssh -C) the tail of the deflatestream can be lost, so the child sees premature EOF.
Fix
Read the preamble reliably:
Nonecase, therefore noTypeError;