Skip to content

Conversation

@rhatdan
Copy link
Member

@rhatdan rhatdan commented Nov 20, 2025

This fixes a race condition where concurrent 'podman artifact add' commands for different artifacts would result in only one artifact being created, without any error messages.

The root cause was in the artifact store's Add() method, which would:

  1. Acquire lock
  2. Read OCI layout index
  3. Create ImageDestination (which snapshots the index)
  4. RELEASE lock (optimization for blob copying)
  5. Copy blobs (while unlocked)
  6. Reacquire lock
  7. Commit changes (write new index)

When two concurrent additions happened:

  • Process A: Lock → Read index → Create dest A → Unlock → Copy blobs
  • Process B: Lock → Read index (no artifact A!) → Create dest B → Unlock
  • Process A: Lock → Commit (write index with A)
  • Process B: Lock → Commit (write index with B, OVERWRITING A)

The fix keeps the lock held for the entire operation. While this reduces concurrency for blob copying, it prevents the index file corruption that caused artifacts to be lost.

Changes:

  • Remove lock release/reacquire around blob copying in store.Add()
  • Simplify lock management (no more conditional unlock)
  • Add e2e test for concurrent artifact additions
  • Add standalone test script to verify the fix

Fixes: #27569

Generated-with: Cursor AI

Checklist

Ensure you have completed the following checklist for your pull request to be reviewed:

  • Certify you wrote the patch or otherwise have the right to pass it on as an open-source patch by signing all
    commits. (git commit -s). (If needed, use git commit -s --amend). The author email must match
    the sign-off email address. See CONTRIBUTING.md
    for more information.
  • Referenced issues using Fixes: #00000 in commit message (if applicable)
  • Tests have been added/updated (or no tests are needed)
  • Documentation has been updated (or no documentation changes are needed)
  • All commits pass make validatepr (format/lint checks)
  • Release note entered in the section below (or None if no user-facing changes)

Does this PR introduce a user-facing change?

podman artifact add can now run without race conditions on adds.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rhatdan
Once this PR has been reviewed and has the lgtm label, please assign mtrmac for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@baude
Copy link
Member

baude commented Nov 20, 2025

@rhatdan you know you cannot do this in podman, it has to be done in common

@packit-as-a-service
Copy link

[NON-BLOCKING] Packit jobs failed. @containers/packit-build please check. Everyone else, feel free to ignore.

@mheon
Copy link
Member

mheon commented Nov 20, 2025

This is going to badly hurt usability - artifact addition on large artifacts can be slow, and now it's going to block all other artifact operations. The locking @Luap99 is doing for pulls could potentially be adapted to improve matters, but the amount of effort he had to put in to get that working was substantial, and this is a different enough codepath that I don't think much would carry over from that work.

@mheon
Copy link
Member

mheon commented Nov 20, 2025

Not that we shouldn't merge this, once it's migrated to container-libs... Just that it's going to suck once we do.

Not sure how I feel about the test, honestly. Is the race 100% consistent on main? I don't like deterministic tests, if it doesn't fail with 100% consistency, we can break without realizing it, and it just shows up as a flake after the fact.

@Luap99
Copy link
Member

Luap99 commented Nov 21, 2025

This is going to badly hurt usability - artifact addition on large artifacts can be slow, and now it's going to block all other artifact operations. The locking @Luap99 is doing for pulls could potentially be adapted to improve matters, but the amount of effort he had to put in to get that working was substantial, and this is a different enough codepath that I don't think much would carry over from that work.

Yeah this is going to suck. IMO the design to use the oci layout as backing store was not to great as we funnel all updates thought the index.json writes where in many cases a more parallel approach could have been possible I suppose. The content addressed sha based file store does make sense for me overall but having the metdata in this one file is causing to much contention and since we just slapped locking on top without fine grade locking around index.json we are in this situation now.

Like in many other parts of our codebase when doing state updates once must take the lock, then read the state from disk, then update state and write it back to disk and only then unlock.

So what we did wrong here is that the state was no loaded back from disk as such we wrote an older in memory kept copy back to disk, so the actual fix should be to re-read the store which is more or less what my unlocked layer add work in storages comes down to.


we do have a some parallel testing for a few things, i.e. this one

@test "podman parallel build should not race" {

I don't mind such tests as long as they actually somewhat reliable reproduce the current issue and it help to safe guard again other bugs in the future. If this actually do flakes we know the fix didn't work and can always disable the test until we can fix the code.

@rhatdan
Copy link
Member Author

rhatdan commented Nov 21, 2025

@rhatdan you know you cannot do this in podman, it has to be done in common

Cursor, did it, I will take a look and move it to container-libs

This fixes a race condition where concurrent 'podman artifact add'
commands for different artifacts would result in only one artifact
being created, without any error messages.

The root cause was in the artifact store's Add() method, which
would:
1. Acquire lock
2. Read OCI layout index
3. Create ImageDestination (which snapshots the index)
4. RELEASE lock (optimization for blob copying)
5. Copy blobs (while unlocked)
6. Reacquire lock
7. Commit changes (write new index)

When two concurrent additions happened:
- Process A: Lock → Read index → Create dest A → Unlock → Copy blobs
- Process B: Lock → Read index (no artifact A!) → Create dest B → Unlock
- Process A: Lock → Commit (write index with A)
- Process B: Lock → Commit (write index with B, OVERWRITING A)

The fix keeps the lock held for the entire operation. While this
reduces concurrency for blob copying, it prevents the index file
corruption that caused artifacts to be lost.

Changes:
- Remove lock release/reacquire around blob copying in store.Add()
- Simplify lock management (no more conditional unlock)
- Add e2e test for concurrent artifact additions
- Add standalone test script to verify the fix

Fixes: containers#27569

Generated-with: Cursor AI
Signed-off-by: Daniel J Walsh <[email protected]>
@rhatdan rhatdan changed the title Fix race condition in concurrent artifact additions [WIP] Fix race condition in concurrent artifact additions Nov 21, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Running two podman artifact adds at the same time, fails to create both artifacts with no errors.

4 participants