-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[WIP] Fix race condition in concurrent artifact additions #27574
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rhatdan The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@rhatdan you know you cannot do this in podman, it has to be done in common |
|
[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore. |
|
This is going to badly hurt usability - artifact addition on large artifacts can be slow, and now it's going to block all other artifact operations. The locking @Luap99 is doing for pulls could potentially be adapted to improve matters, but the amount of effort he had to put in to get that working was substantial, and this is a different enough codepath that I don't think much would carry over from that work. |
|
Not that we shouldn't merge this, once it's migrated to container-libs... Just that it's going to suck once we do. Not sure how I feel about the test, honestly. Is the race 100% consistent on main? I don't like deterministic tests, if it doesn't fail with 100% consistency, we can break without realizing it, and it just shows up as a flake after the fact. |
Yeah this is going to suck. IMO the design to use the oci layout as backing store was not to great as we funnel all updates thought the index.json writes where in many cases a more parallel approach could have been possible I suppose. The content addressed sha based file store does make sense for me overall but having the metdata in this one file is causing to much contention and since we just slapped locking on top without fine grade locking around index.json we are in this situation now. Like in many other parts of our codebase when doing state updates once must take the lock, then read the state from disk, then update state and write it back to disk and only then unlock. So what we did wrong here is that the state was no loaded back from disk as such we wrote an older in memory kept copy back to disk, so the actual fix should be to re-read the store which is more or less what my unlocked layer add work in storages comes down to. we do have a some parallel testing for a few things, i.e. this one podman/test/system/070-build.bats Line 247 in 3922526
I don't mind such tests as long as they actually somewhat reliable reproduce the current issue and it help to safe guard again other bugs in the future. If this actually do flakes we know the fix didn't work and can always disable the test until we can fix the code. |
Cursor, did it, I will take a look and move it to container-libs |
This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages. The root cause was in the artifact store's Add() method, which would: 1. Acquire lock 2. Read OCI layout index 3. Create ImageDestination (which snapshots the index) 4. RELEASE lock (optimization for blob copying) 5. Copy blobs (while unlocked) 6. Reacquire lock 7. Commit changes (write new index) When two concurrent additions happened: - Process A: Lock → Read index → Create dest A → Unlock → Copy blobs - Process B: Lock → Read index (no artifact A!) → Create dest B → Unlock - Process A: Lock → Commit (write index with A) - Process B: Lock → Commit (write index with B, OVERWRITING A) The fix keeps the lock held for the entire operation. While this reduces concurrency for blob copying, it prevents the index file corruption that caused artifacts to be lost. Changes: - Remove lock release/reacquire around blob copying in store.Add() - Simplify lock management (no more conditional unlock) - Add e2e test for concurrent artifact additions - Add standalone test script to verify the fix Fixes: containers#27569 Generated-with: Cursor AI Signed-off-by: Daniel J Walsh <[email protected]>
This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages.
The root cause was in the artifact store's Add() method, which would:
When two concurrent additions happened:
The fix keeps the lock held for the entire operation. While this reduces concurrency for blob copying, it prevents the index file corruption that caused artifacts to be lost.
Changes:
Fixes: #27569
Generated-with: Cursor AI
Checklist
Ensure you have completed the following checklist for your pull request to be reviewed:
commits. (
git commit -s). (If needed, usegit commit -s --amend). The author email must matchthe sign-off email address. See CONTRIBUTING.md
for more information.
Fixes: #00000in commit message (if applicable)make validatepr(format/lint checks)Noneif no user-facing changes)Does this PR introduce a user-facing change?