[WIP] Fix race condition in concurrent artifact additions #27574

rhatdan · 2025-11-20T18:38:08Z

This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages.

The root cause was in the artifact store's Add() method, which would:

Acquire lock
Read OCI layout index
Create ImageDestination (which snapshots the index)
RELEASE lock (optimization for blob copying)
Copy blobs (while unlocked)
Reacquire lock
Commit changes (write new index)

When two concurrent additions happened:

Process A: Lock → Read index → Create dest A → Unlock → Copy blobs
Process B: Lock → Read index (no artifact A!) → Create dest B → Unlock
Process A: Lock → Commit (write index with A)
Process B: Lock → Commit (write index with B, OVERWRITING A)

The fix keeps the lock held for the entire operation. While this reduces concurrency for blob copying, it prevents the index file corruption that caused artifacts to be lost.

Changes:

Remove lock release/reacquire around blob copying in store.Add()
Simplify lock management (no more conditional unlock)
Add e2e test for concurrent artifact additions
Add standalone test script to verify the fix

Fixes: #27569

Generated-with: Cursor AI

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
the sign-off email address. See CONTRIBUTING.md
for more information.
Referenced issues using Fixes: #00000 in commit message (if applicable)
Tests have been added/updated (or no tests are needed)
Documentation has been updated (or no documentation changes are needed)
All commits pass make validatepr (format/lint checks)
Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

podman artifact add can now run without race conditions on adds.

openshift-ci · 2025-11-20T18:38:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rhatdan
Once this PR has been reviewed and has the lgtm label, please assign mtrmac for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

baude · 2025-11-20T18:42:25Z

@rhatdan you know you cannot do this in podman, it has to be done in common

packit-as-a-service · 2025-11-20T19:18:15Z

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

mheon · 2025-11-20T23:03:38Z

This is going to badly hurt usability - artifact addition on large artifacts can be slow, and now it's going to block all other artifact operations. The locking @Luap99 is doing for pulls could potentially be adapted to improve matters, but the amount of effort he had to put in to get that working was substantial, and this is a different enough codepath that I don't think much would carry over from that work.

mheon · 2025-11-20T23:05:33Z

Not that we shouldn't merge this, once it's migrated to container-libs... Just that it's going to suck once we do.

Not sure how I feel about the test, honestly. Is the race 100% consistent on main? I don't like deterministic tests, if it doesn't fail with 100% consistency, we can break without realizing it, and it just shows up as a flake after the fact.

Luap99 · 2025-11-21T10:56:03Z

This is going to badly hurt usability - artifact addition on large artifacts can be slow, and now it's going to block all other artifact operations. The locking @Luap99 is doing for pulls could potentially be adapted to improve matters, but the amount of effort he had to put in to get that working was substantial, and this is a different enough codepath that I don't think much would carry over from that work.

Yeah this is going to suck. IMO the design to use the oci layout as backing store was not to great as we funnel all updates thought the index.json writes where in many cases a more parallel approach could have been possible I suppose. The content addressed sha based file store does make sense for me overall but having the metdata in this one file is causing to much contention and since we just slapped locking on top without fine grade locking around index.json we are in this situation now.

Like in many other parts of our codebase when doing state updates once must take the lock, then read the state from disk, then update state and write it back to disk and only then unlock.

So what we did wrong here is that the state was no loaded back from disk as such we wrote an older in memory kept copy back to disk, so the actual fix should be to re-read the store which is more or less what my unlocked layer add work in storages comes down to.

we do have a some parallel testing for a few things, i.e. this one

podman/test/system/070-build.bats

Line 247 in 3922526

@test "podman parallel build should not race" {

I don't mind such tests as long as they actually somewhat reliable reproduce the current issue and it help to safe guard again other bugs in the future. If this actually do flakes we know the fix didn't work and can always disable the test until we can fix the code.

rhatdan · 2025-11-21T14:48:08Z

@rhatdan you know you cannot do this in podman, it has to be done in common

Cursor, did it, I will take a look and move it to container-libs

This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages. The root cause was in the artifact store's Add() method, which would: 1. Acquire lock 2. Read OCI layout index 3. Create ImageDestination (which snapshots the index) 4. RELEASE lock (optimization for blob copying) 5. Copy blobs (while unlocked) 6. Reacquire lock 7. Commit changes (write new index) When two concurrent additions happened: - Process A: Lock → Read index → Create dest A → Unlock → Copy blobs - Process B: Lock → Read index (no artifact A!) → Create dest B → Unlock - Process A: Lock → Commit (write index with A) - Process B: Lock → Commit (write index with B, OVERWRITING A) The fix keeps the lock held for the entire operation. While this reduces concurrency for blob copying, it prevents the index file corruption that caused artifacts to be lost. Changes: - Remove lock release/reacquire around blob copying in store.Add() - Simplify lock management (no more conditional unlock) - Add e2e test for concurrent artifact additions - Add standalone test script to verify the fix Fixes: containers#27569 Generated-with: Cursor AI Signed-off-by: Daniel J Walsh <[email protected]>

openshift-ci bot added the release-note label Nov 20, 2025

rhatdan force-pushed the artifact branch from 83c253e to 2247980 Compare November 21, 2025 15:06

rhatdan changed the title ~~Fix race condition in concurrent artifact additions~~ [WIP] Fix race condition in concurrent artifact additions Nov 21, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2025

Luap99 mentioned this pull request Nov 22, 2025

Fix race condition in concurrent artifact additions containers/container-libs#483

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix race condition in concurrent artifact additions #27574

[WIP] Fix race condition in concurrent artifact additions #27574

Uh oh!

rhatdan commented Nov 20, 2025 •

edited

Loading

Uh oh!

openshift-ci bot commented Nov 20, 2025

Uh oh!

baude commented Nov 20, 2025

Uh oh!

packit-as-a-service bot commented Nov 20, 2025

Uh oh!

mheon commented Nov 20, 2025

Uh oh!

mheon commented Nov 20, 2025

Uh oh!

Luap99 commented Nov 21, 2025

Uh oh!

rhatdan commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[WIP] Fix race condition in concurrent artifact additions #27574

Are you sure you want to change the base?

[WIP] Fix race condition in concurrent artifact additions #27574

Uh oh!

Conversation

rhatdan commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Does this PR introduce a user-facing change?

Uh oh!

openshift-ci bot commented Nov 20, 2025

Uh oh!

baude commented Nov 20, 2025

Uh oh!

packit-as-a-service bot commented Nov 20, 2025

Uh oh!

mheon commented Nov 20, 2025

Uh oh!

mheon commented Nov 20, 2025

Uh oh!

Luap99 commented Nov 21, 2025

Uh oh!

rhatdan commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rhatdan commented Nov 20, 2025 •

edited

Loading