Skip to content

Conversation

@QiWang19
Copy link
Member

@QiWang19 QiWang19 commented Nov 24, 2025

WIP PR cancelable reader and timeout go routine. we'd like to package a binary with these changes for the customer to test .

@github-actions github-actions bot added the image Related to "image" package label Nov 24, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 24, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6533

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For reference, compare earlier #421 .)

Testing and research are always good to do.

Do note that we know a bit more than at the time when #421 was filed: there seems to be something wrong with tracking the state of “the write end of the --command-fd pipe”. That pipe is not directly driven by caller-provided inputs to GPGME; so if the hypothesis is correct, and if the state tracking causes incorrectly reused / closed file descriptors, that would still affect, and probably break, CRI-O (compare “the write ends of all of these pipes are owned by conmon” in Jira ).

mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(data)
if err != nil {
return sarRejected, nil, err
// Import the keys with a 60s timeout to avoid hanging indefinitely. see issues.redhat.com/browse/OCPBUGS-57893
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really quite strongly don’t want this complexity in core policy enforcement code.

Also, I think the core complexity of the cleanup attempt here comes from separating the new…Mechanism initialization and defer mech.Close() into different goroutines. I think running all of the PolicyContext.IsRunningImageAllowed within a goroutine with a timeout would automatically handle cleanup (… or not, and leak a C thread+goroutine). In what respect is this approach better?


This is really orthogonal to the other part of the of the PR: if the approach to terminate GPGME via a callback failing works, this moving of the code into a concurrent goroutine does not need to exist (and continues to have the issues of #421 ).

Copy link
Member Author

@QiWang19 QiWang19 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the cimplexity and the still has the hang in c library. for debug wise, drop the select waiting on timeout from the core policy code base, so we can debugging on the cancelable reader.

Thanks for the note.

@mtrmac mtrmac marked this pull request as draft November 25, 2025 03:33
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 25, 2025
// FIXME: move this to per-context initialization
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(data)
// Pass the context so the cancelable reader can detect cancellation and potentially abort GPGME operations
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(ctx, data)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(ctx, data)
importCtx, cancel := context.WithTimeout(ctx, importKeyTimeOut)
defer cancel()
mech, trustedIdentities, err := newEphemeralGPGSigningMechanism(importCtx, data)

@mtrmac Updated the code. Do we still need to add timeout context pass down, so the cancellable reader to work? if not hardcode a timeout here, will the caller cancel the context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid I can’t parse the question. Are you asking whether the CRI-O server already has a timeout / request cancellation logic of some kind (maybe triggering when/if the Kubelet client gives up)? That would be a CRI-O question.

And if CRI-O did pass to c/image/copy.Image a context that would in some situations be canceled, yes, I think such a context should be forwarded all the way to this code.


I would, by default, expect the timeout, if any, to be set at a pretty high-level; I was thinking one for the whole RPC operation served by CRI-O … except that that doesn’t work well for pulls, which, on slow links, may be making steady progress over many minutes, so there is no good timeout that can be set for the whole pull by default.

So, I guess, in the caller of IsRunningImageAllowed in c/image would be at least better than here prSignedBy; IsRunningImageAllowed is not really constant-time, but it is fairly bounded, and it would mean less new code in c/image/signature.

[There’s also CRI-O’s PullProgressTimeout which, last time I looked at it, made unwarranted assumptions about how c/image reports progress, and thus was ~unusable. But if it, hypothetically, did not trigger on false positives where c/image does not report progress but may spend a lot of time, it would trigger cancellation of the context and that should be visible in the GPGME mechanism.]

podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Nov 25, 2025
Comment on lines +54 to +62
// Check if context is cancelled before each read
if err := r.ctx.Err(); err != nil {
return 0, err
}
n, err := r.reader.Read(p)
// Check again after read in case cancellation happened during the read
if err == nil && r.ctx.Err() != nil {
return n, r.ctx.Err()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this even work/address the problem in any case? The issue is the Read is blocking so checking the context before/after will not solve the issue if the bug is is read() blocks forever.

I mean sure it does work if the the problem is that it keep reading indefinitely over and over in some loop and we want to timeout that but per #421 I got the impression that is not the case.

If we use a real pipe as reader go exposes SetDeadline() so that could be used to limit the actual time the read would block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPGME is a C library running gnupg as a subprocess, and communicating with it using a ~poll loop — and something goes wrong and hangs there. (My limited investigation suggests that this centers on a control FD which is not visible in the GPGME API at all, but that might be wrong or a consequence of some earlier failure.)

Replacing one of the data sources with something that can indicate failure is hypothetically a way to interrupt that poll loop — if the code has a reason to execute one more iteration of the poll. There’s not much reason to think that the code will actually have a chance to poll again, but, also, that’s the ~only cancellation mechanism we have. That would absolutely not be a fix but it might be a pragmatic workaround if it happened to be proven to affect things.

The issue is the Read is blocking so checking the context before/after will not solve the issue if the bug is is read() blocks forever.

In this PR, the cancellation is on a source that is passed as a byte array, so it should never block. OTOH that also makes it fairly unlikely that the cancellation opportunity would manifest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, then I don't think the context check on read can help to work around it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image Related to "image" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants