Skip to content

Conversation

cgwalters
Copy link
Collaborator

The original model the idea here is the proxy centralizes verification of things like digest. However in practice, this causes reading to be seriously awkward; ref
bootc-dev/containers-image-proxy-rs#79

Also, I have a project to implement a registry frontend to containers-storage: and a core problem with GetBlob right now is it requires the blob size up front even though the underlying Go logic doesn't.

Moving to a "raw" interface solves that.

@cgwalters
Copy link
Collaborator Author

Draft since this works well for me, but needs tests

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

  • The part where the size is not required LGTM — I don’t remember how it happened that the size is ~never used in c/image (note to self, apart from blobCacheSource). We should note that in the c/image documentation to guide possible future API redesigns.
  • I don’t understand how moving the digest verification significantly helps.
    • If nothing else, after removing the error reporting mechanism a consumer won’t be able to differentiate between a (retryable) network error and a digest validation error. Admittedly, currently the proxy error handling mechanism doesn’t allow cleanly differentiating that anyway, but this removes the option of improving that.
    • I don’t understand how this helps at the Rust client side. Assuming the digest verification moves to the Rust side (*shrug*), there is something like AsyncRead which returns async StreamOfBytesOrError providing a digest-validated byte stream. It seems to me that the Rust client side can trivially call FinishPipe at the place where it would do the digest verification, what’s the difficulty?! Some kind of Rust-related reference counting making it hard to hold on to a proxy handle at that time?
    • And if FinishPipe remained, I don’t see much of a reason to prefer the digest verification happening at one side or the other, except perhaps that sha512 digests are coming, and that will be transparently handled with the Go code as is.

@cgwalters
Copy link
Collaborator Author

Forgot to mention in this PR but this is a dependency of https://github.com/cgwalters/cstor-dist which you may also be interested in

@cgwalters
Copy link
Collaborator Author

It seems to me that the Rust client side can trivially call FinishPipe at the place where it would do the digest verification,

Sorry the issue with more real discussion is bootc-dev/containers-image-proxy-rs#80 which TL;DR is that it's a problem that FinishPipe blocks the control channel.

Which we could definitely also fix without GetBlobRaw indeed.

If nothing else, after removing the error reporting mechanism a consumer won’t be able to differentiate between a (retryable) network error and a digest validation error.

Yes it'd be really nice to handle that indeed. I guess a possible design here would be GetBlobFramed which returns

#[derive(Serialize)]
enum Message {
  Data(&[u8])
  Error(...)
}

or so? (Framed in something like bincode since json is pretty suboptimal for binary data)

I guess this whole thing quickly goes back to less-ad-hoc IPC frameworks.

@mtrmac
Copy link
Contributor

mtrmac commented May 13, 2025

containers/containers-image-proxy-rs#80 which TL;DR is that it's a problem that FinishPipe blocks the control channel.

Do I understand it correctly that there is thread A consuming the blob, and thread B calling FinishPipe before thread A is done reading? Yes, that would stall for a potentially very long time, but, also, if that is the design, why doesn’t A call FinishPipe itself? Having B receive the digest verification / network error and then somehow communicate it back to A looks needlessly complicated to me.

After A finishes consuming data, FinishPipe should be a metadata operation. Sure, there is some context switching and it’s probably a few $something thousand instructions, but it should be trivial compared to any network or disk cost.

(Even with a design where the proxy is all “owned” by a single proxy client thread, making it undesirable for thread A to make requests directly, having A communicate to the proxy client and block on a reply might not be that much worse? It does serialize everything at the proxy client, but that would be ~equivalent in effect to the serialization that happens in the proxy server anyway.)


There is a potentially real issue here in that FinishPipe serializes against some other thread C’s GetBlob initiating a download, i.e. network round-trips required to set up TLS and do a GET request, before the body starts arriving. That would be some delay but, on the whole, I would also expect it to not be significant — this is not interactive UI with a 50 ms deadline, and the critical path will probably include this thread C with its slow GET as well as consuming all of that blob afterwards: before C gets to FinishPipe, A will have ample opportunity to perform its own FinishPipe, so A is not the slowest part.


Yes it'd be really nice to handle that indeed. I guess a possible design here would be GetBlobFramed

Yes.

I guess this whole thing quickly goes back to less-ad-hoc IPC frameworks.

And that too.

@cgwalters
Copy link
Collaborator Author

It seems to me that the Rust client side can trivially call FinishPipe at the place where it would do the digest verification, what’s the difficulty?! Some kind of Rust-related reference counting making it hard to hold on to a proxy handle at that time?

Yeah I think you're right. We could probably just automatically handle this in the rust client code. So it's not a strong argument for GetBlobRaw in and of itself.

And if FinishPipe remained, I don’t see much of a reason to prefer the digest verification happening at one side or the other,

Note for the use case of cstor-dist we're going:
containers-storage: -> cstor-dist (registry) -> containers-storage:|ostree|composefs-rs

And it'd be silly to have cstor-dist also do checksum verification because we know the client is going to do it. So probably regardless IMO some sort of GetBlobRaw makes sense right?

except perhaps that sha512 digests are coming, and that will be transparently handled with the Go code as is.

I think any nontrivial consumer of the proxy is going to be parsing the manifest digests and doing things with them. The ostree and composefs-rs projects definitely are and so there's no magic adaptation.

@cgwalters
Copy link
Collaborator Author

Anyways though, the other option is to just cut the size requirement for GetBlob, that would be a pretty straightforward patch too. I may look at that.

@mtrmac
Copy link
Contributor

mtrmac commented May 13, 2025

And it'd be silly to have cstor-dist also do checksum verification because we know the client is going to do it. So probably regardless IMO some sort of GetBlobRaw makes sense right?

Yes, that’s a good enough reason.


Anyways though, the other option is to just cut the size requirement for GetBlob, that would be a pretty straightforward patch too. I may look at that.

Yes, allowing -1 through should work fine.

@mtrmac
Copy link
Contributor

mtrmac commented May 13, 2025

And it'd be silly to have cstor-dist also do checksum verification because we know the client is going to do it. So probably regardless IMO some sort of GetBlobRaw makes sense right?

Yes, that’s a good enough reason.

… to allow the client to opt out of digest verification, somehow. I’m not too convinced about removing the error handling entirely.

@cgwalters
Copy link
Collaborator Author

cgwalters commented May 13, 2025

I’m not too convinced about removing the error handling entirely.

But in the raw case the error handling is literally just whether or not the proxy successfully wrote to the pipe, right? Which, most often is actually an EPIPE because the client closed the pipe because it found some other error, but that then boomerangs back to the client! We have some hacks here https://github.com/bootc-dev/bootc/blob/c1d67aa07c3408694b46b74deb7be10587b70fe7/ostree-ext/src/container/unencapsulate.rs#L151

Which, the only way to fix that sanely I think is to have a new API creatively called GetBlob2. So putting several things together:

  • Add GetBlob2 DIGEST [verify: bool]

It returns...umm, I am vacillating on what the format should be. I tried hard so far to only make this thing about JSON and pipe() but it's tricky to keep doing it. I guess one option is that it returns two pipe read fds, one which returns JSON only in the case of error, and the other which is the raw data stream? The other option is BSON or again going into a bigger RPC framework like capn'proto or whatever but for now I think my preference is continuing to avoid that.

So what gets sent over the error fd would be at most one of:

struct Error {
  code: String,
  message: String
}

But we'd need to bikeshed how code appears, I am actually unfamiliar with the best practices here. Error mapping cross-language IPC is in my experience a hard problem. One possibility is to squash everything into one of say four cases:

  • network
  • EPIPE
  • checksum
  • other

(And yes we should probably emit an error if we get EPIPE, but the client can know to safely ignore it)

@mtrmac
Copy link
Contributor

mtrmac commented May 14, 2025

Just a note on this part for now:

I’m not too convinced about removing the error handling entirely.

But in the raw case the error handling is literally just whether or not the proxy successfully wrote to the pipe, right?

On an error reading from the registry, it’s the full information the Go server side has: “connection reset / unexpected EOF / Transfer-Encoding: Chunked framing invalid / timed out / completely unexpected EPERM / whatever”. Maybe the ~expected network errors should be handled / retried automatically. I think it would be very valuable for the 100% unexpected errors to be propagated to the Rust client and ultimately to a human. For that, we need, at the very least, to transfer an error string.

EPIPE is an interesting case that needs special handling, sure, but I don’t think it’s the only one.


It returns...umm, I am vacillating on what the format should be. I tried hard so far to only make this thing about JSON and pipe() but it's tricky to keep doing it. I guess one option is that it returns two pipe read fds, one which returns JSON only in the case of error, and the other which is the raw data stream?

Note to self: Return here. We’ve had GetBlobAt returning a separate data stream and an error stream, and it was very inconvenient to consume — but that might not be applicable here.


struct Error {
  code: String,
  message: String
}

But we'd need to bikeshed how code appears, I am actually unfamiliar with the best practices here.

Error mapping cross-language IPC is in my experience a hard problem.

It’s hard, in the corner cases, even within a single-language process :)

There’s ample space to over-engineer. I think the above is a good start, and we can always add code-specific fields (turning this into an ~tagged union).

The extreme is to adopt something like https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties , where the error can have an open-ended list of non-exclusive bools (“network-related”? “retryable”? “timeout”? All 3 could be true simultaneously). I think that’s overkill.


I think it would be convenient to use the same error mechanism for the current reply. Not blocking, but it might simplify consumers.

@mtrmac
Copy link
Contributor

mtrmac commented May 14, 2025

It returns...umm, I am vacillating on what the format should be. I tried hard so far to only make this thing about JSON and pipe() but it's tricky to keep doing it. I guess one option is that it returns two pipe read fds, one which returns JSON only in the case of error, and the other which is the raw data stream?

Note to self: Return here. We’ve had GetBlobAt returning a separate data stream and an error stream, and it was very inconvenient to consume — but that might not be applicable here.

Ref. containers/storage#2162 : If there is a dataFD and errorFD, and the producer writes an error to errorFD and closes both, the consumer must, after seeing EOF on dataFD, still read errorFD, because the kernel does not guarantee a poll or similar will see the error incoming on errorFD before the EOF on dataFD (and user-space poll / event loop abstractions probably don’t guarantee ordering either).

A single pipe is easier to consume and reason about.

@mtrmac
Copy link
Contributor

mtrmac commented May 14, 2025

One possible way to think about this is that consuming two pipes requires the caller to always read the error pipe until EOF; that’s probably not any easier than, in the current API, calling FinishPipe after reaching EOF on the data stream.

@cgwalters
Copy link
Collaborator Author

Ref. containers/storage#2162 : If there is a dataFD and errorFD, and the producer writes an error to errorFD and closes both, the consumer must, after seeing EOF on dataFD, still read errorFD, because the kernel does not guarantee a poll or similar will see the error incoming on errorFD before the EOF on dataFD (and user-space poll / event loop abstractions probably don’t guarantee ordering either).

Right, but that's easy? I mean at least in Rust I'd map this as two futures which both need to be joined, not a select as happened in that c/storage code originally. This is only a little tangential but probably interesting nonetheless; tokio::select! in Rust has a giant trap and always needs careful auditing and people talk about "cancel safety" because of this. (TL;DR Go uses stackful coroutines, Rust's futures are stackless and precisely sized, but the lack of stack means you can get burned by cancellation).

One possible way to think about this is that consuming two pipes requires the caller to always read the error pipe until EOF; that’s probably not any easier than, in the current API, calling FinishPipe after reaching EOF on the data stream.

Mmmmm...except the problem again with today's API is that the client doesn't get EOF until it calls FinishPipe. The rationale for that is it forces callers to invoke FinishPipe (to check for errors), and the workaround we did for this is basically "read only up until the known blob size" which is viable but definitely not ergonomic or obvious at all for callers.

@mtrmac
Copy link
Contributor

mtrmac commented May 14, 2025

Mmmmm...except the problem again with today's API is that the client doesn't get EOF until it calls FinishPipe.

Oops, I have completely missed that when discussing this PR. That makes many of the things I suggested above, thinking that they are simple, rather less practical.

the workaround we did for this is basically "read only up until the known blob size" which is viable

[GetBlob is allowed to return -1 as blob size, and that happens sometimes.]

… does error reporting work at all if there is a network error in the middle of a blob? The goroutine in GetBlob will see an error, and terminate, but the write end of the pipe will be kept open. The client will be stuck in read on a pipe, never see an EOF, and it will never have a reason to call FinishPipe.

Am I missing something again? If it does fail this way, then yes, we definitely need a new GetBlob (and everything else that uses pipes??).

@cgwalters
Copy link
Collaborator Author

[GetBlob is allowed to return -1 as blob size, and that happens sometimes.]

That doesn't matter though because we always know the expected size from the manifest descriptor and that's what clients should use, right?

… does error reporting work at all if there is a network error in the middle of a blob? The goroutine in GetBlob will see an error, and terminate, but the write end of the pipe will be kept open. The client will be stuck in read on a pipe, never see an EOF, and it will never have a reason to call FinishPipe.

Well, the way this is working right now is that the calling process in Rust creates two futures, one doing the read/processing and one calling FinishPipe. So in the case of network error we'll get an error from the latter. But that brings the concurrency problem that Allison was talking about where any FinishPipe blocks metadata stream and so...yeah it's a huge pain in the butt to in practice use this to do concurrent fetches, even though it was designed to do so.

It'd be really quite ugly but I suspect a viable hack for the status quo would be to only queue the FinishPipe on a timeout if we stop getting data from the pipe or so (and that happens before the expected size)

@mtrmac
Copy link
Contributor

mtrmac commented May 14, 2025

we always know the expected size from the manifest descriptor

schema1 manifests don’t have sizes. (OTOH at least Docker has been working for years to disable schema1 manifests, e.g. distribution/distribution only accepts them with an opt-in.)

@cgwalters
Copy link
Collaborator Author

schema1 manifests don’t have sizes.

We hard error on that case today already.

I honestly would actually like to add a Open method that hard requires OCI even - for the use cases I care about in bootc, there's simply no reason to continue to support v2s2.

@cgwalters
Copy link
Collaborator Author

OK I reworked this per discussion and also updated the example consumer in bootc-dev/containers-image-proxy-rs#83 - just got as far as verifying the example client there works. Could still use tests here etc.

Copy link

Tests failed. @containers/packit-build please check.

Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

ACK overall. (Note also the lint failures.)

@cgwalters cgwalters changed the title proxy: Add GetBlobRaw proxy: Add GetRawBlob May 16, 2025
@cgwalters cgwalters marked this pull request as ready for review May 16, 2025 12:16
@cgwalters cgwalters force-pushed the get-raw-blob branch 2 times, most recently from b9267e8 to 4e797b4 Compare May 16, 2025 12:19
Copy link

Ephemeral COPR build failed. @containers/packit-build please check.

@cgwalters cgwalters force-pushed the get-raw-blob branch 2 times, most recently from 0916ddf to 62782cf Compare May 16, 2025 17:53
@cgwalters
Copy link
Collaborator Author

(It'd sure be nice if github had a "commit as fixup" button)

cgwalters added 2 commits May 16, 2025 14:14
The original model the idea here is the proxy centralizes
verification of things like digest. However in practice,
this causes reading to be seriously awkward; ref
bootc-dev/containers-image-proxy-rs#79
(Basically `FinishPipe` blocks the metadata channel)

Also, I have a project to implement a registry frontend to
`containers-storage:` and a core problem with `GetBlob` right
now is it *requires* the blob size up front even though the
underlying Go logic doesn't.

Moving to a "raw" interface solves that too. In this new
raw API, we return two file descriptors, one for the data
and one for the error channel, which contains a JSON
serialization of an error.

For the error type we reuse the existing "is error retryable"
and expose that back to the client.

We also (backwards compatibly) add this new error code
for the existing APIs.

Signed-off-by: Colin Walters <[email protected]>
Pre-existing problem noticed in review.

Signed-off-by: Colin Walters <[email protected]>
Copy link
Contributor

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@mtrmac mtrmac merged commit 39445ea into containers:main May 16, 2025
29 checks passed
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Aug 15, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants