Skip to content

refactor: no regexp-based Reference::try_from#322

Open
DavSanchez wants to merge 1 commit intoyouki-dev:mainfrom
DavSanchez:refactor/no-regexp-reference-from-str
Open

refactor: no regexp-based Reference::try_from#322
DavSanchez wants to merge 1 commit intoyouki-dev:mainfrom
DavSanchez:refactor/no-regexp-reference-from-str

Conversation

@DavSanchez
Copy link
Copy Markdown
Contributor

@DavSanchez DavSanchez commented Mar 6, 2026

What type of PR is this?

/kind other

What this PR does / why we need it:

During execution of a program that has oci-spec as a dependency and performs Reference::from_str from different threads (we found this when calling that function from only one thread at a time of a multi-threaded program) we noticed, besides the initial hit when the regexp gets stored in the OnceLock, regular increases of around 6 MB of RSS that were not freed even after the thread was joined.

I have set up a reproducer at https://github.com/DavSanchez/oci-parse-threaded-repro that shows the behaviour. It only spawns threads one at a time, calls Reference::from_str on a string that switches back and forth between 2 alternatives (but check the README for reports of using more than 2 possible strings), sleeps, calls ps and prints to stdout.

It also contains some example logs of running the reproducer using this crate as-is and with the fork. Also included DHAT reports for both and a way to generate your own via a feature using the dhat crate. See the darwin-reports and linux-reports directories to find these reports.

This is an example of one of these reports comparing the RSS of the current implementation vs the fork, on an aarch64-linux VM.

cycle rss (KB) upstream rss (KB) fork
0 15092 512
1 15220 512
2 20980 512
3 21236 512
4 26868 512
5 27124 512
6 32884 512
7 33140 512
8 38900 512
9 38900 512

*same values after this for 80 threads, but increasing the number of OCI refs to parse increases the cap of the current implementation (e.g. to 75+ MB for 8 refs) while keeping flat on the forked version.

We thought that for our use case these allocated amounts were too high and tried with an approach that hopefully covers your expected usage but does not use the regexp, allocating a fraction of the amount. It only replaces the Reference::try_from implementation for &str and adds a couple helper functions. It passes all existing tests in the module which have not been altered.

As an additional datapoint, our program got its reported RSS cut by half when switching to use the forked repo, having performed a single Reference::from_str on each of the runs:

image

Also took the chance to address some of the FIXMEs in the tests.

Let us know your thoughts!

Which issue(s) this PR fixes:

None

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

@DavSanchez DavSanchez force-pushed the refactor/no-regexp-reference-from-str branch from 3446c6c to 94e1bd0 Compare March 7, 2026 00:39
Signed-off-by: David Sánchez <davidslt+git@pm.me>
@DavSanchez DavSanchez force-pushed the refactor/no-regexp-reference-from-str branch from 94e1bd0 to 67fd213 Compare March 7, 2026 00:39
const DEFAULT_TAG: &str = "latest";
/// REFERENCE_REGEXP is the full supported format of a reference. The regexp
/// is anchored and has capturing groups for name, tag, and digest components.
const REFERENCE_REGEXP: &str = r"^((?:(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9])(?:(?:\.(?:[a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9-]*[a-zA-Z0-9]))+)?(?::[0-9]+)?/)?[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?(?:(?:/[a-z0-9]+(?:(?:(?:[._]|__|[-]*)[a-z0-9]+)+)?)+)?)(?::([\w][\w.-]{0,127}))?(?:@([A-Za-z][A-Za-z0-9]*(?:[-_+.][A-Za-z][A-Za-z0-9]*)*[:][[:xdigit:]]{32,}))?$";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would clarify to have in the PR description the memory usage from the reproducer before and after your contribution!

Copy link
Copy Markdown
Contributor Author

@DavSanchez DavSanchez Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a table showing some quick numbers. More details can be checked on the reproducer repo!

tag,
digest,

// Extract the digest (`@<algo>:<hex>`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oci-spec-rs should really try to be closer to the upstream implementation at https://github.com/distribution/reference/blob/main/regexp.go.
Which this doesn't seem to be doing.
See my issue #329

Copy link
Copy Markdown
Contributor Author

@DavSanchez DavSanchez Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, in all fairness I didn't consider the Go implementation as canonical when tackling this. I considered only this crate.

I can try to cover your #329 use case with this approach if not using regexes is still an option considered valid, could check the spec in depth to try and ensure all possible values beyond the tests and my team's use cases are properly covered. If by following the Go implementation means relying on the same mechanism as it I guess the issue might have a chance to alleviate by switching to another regex implementation. It seems to stem from rust-lang/regex#1214 (comment) so a different one might just not show this behavior. In that case I can close this one, our program uses a version of this approach and just avoids calling Reference::from_str, so we are not that impacted for now, and are not making changes unless we discover parsing differences from upstream with use.

Let me know what you'd prefer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello ! My opinion isn't more valid than yours (and you actually opened a PR contrary to me !).
What I would do is check that your impl is compatible with the original Go one and put the reference commit somewhere so we can track drift caused by spec updates.

Maybe we can get a comment from a maintainer to approve this approach ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since each language has different characteristics, the implementations do not need to be identical. While we should refer to existing implementations and ensure the same behavior, I believe it is better to prioritize higher performance as long as there are sufficient test cases.
@saschagrunert I'd like to hear your opinion.

@utam0k utam0k requested review from saschagrunert and utam0k April 4, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants