Specify a loose formatted form? #494

matt-phylum · 2025-03-06T17:05:46Z

matt-phylum
Mar 6, 2025

Compared to the URL spec, PURL is both more strict and less specified. PURL has a concept of a canonical format where the scheme and type are lowercase, the namespace and name are normalized according to type-specific rules, the qualifier keys are lowercased, empty qualifiers are removed, qualifiers are sorted, . and .. segments are removed from subpaths, and exact sets of characters are percent encoded.

This canonical form is nice because it means software can use PURLs as unique identifiers without understanding how to parse them.

However, I would be surprised if any two PURL implementations behaved the same in all cases. Most implementations make at least one mistake or intentional deviation from the spec, usually around percent encoding. This means in practice if you want to use PURLs as unique identifiers you do need to parse them at some point and convert them back to strings using a single implementation.

This has been coming up recently with the work to make the spec use RFC 2119 language. The spec is being updated to say that the formatters MUST produce the canonical output, which was always the case but with "must" instead, but the parsers are expected to accept formats that the formatters are forbidden from producing. The spec is actually describing two different PURLs: the PURLs that parsers are allowed to read and the PURLs that formatters are allowed to write.

It seems like it would make sense to instead document PURLs and canonical PURLs separately.

Benefits:

This would make it easier to write a PURL library because it wouldn't matter if the most convenient URL encoding function available to you encodes more characters than necessary as long as you encode the minimum set of characters.
Implementations wouldn't need to implement (potentially incorrect) name normalization rules for every package type in the PURL spec and keep on top of adding new types. Implementations would support the rules they support (or provide a way for the user to customize the behavior) and pass everything else through without normalization. (see Concerns with type-specific component value transformations #38)
It would avoid confusion in the spec about whether a PURL is allowed to be a certain way during reading vs during writing

But it makes things more complicated for people who are using PURLs as unique identifiers and need to compare PURLs for equality.

dwalluck · 2025-03-13T14:03:42Z

dwalluck
Mar 13, 2025

Almost all of the problems could be solved by simply following RFC 3986.

The encoding would be solved because RFC 3986 defines the exact minimum that has to be encoded in each part.

For decoding, it's difficult if not impossible to use an existing decoder because if you let it handle the decoding and encoding, then the whole path splitting/joining thing can't be done.

But, I am not even sure if any of that would be necessary if the spec would just follow the existing Path Segment Normalization rules prior to decoding the entire path.

Type being case-insensitive can even be gotten for free if it's treated like a host instead (since both scheme and host are supposed to be case-insensitive), but this requires URIs to be in the form "pkg://type/" and the existing pkg:type already breaks the spec, anyway, and should at least be "pkg:/type" if not using an authority.

0 replies

matt-phylum · 2025-03-13T16:57:31Z

matt-phylum
Mar 13, 2025
Author

RFC 3986 isn't simple and does not define the formatting of a PURL. If you map the PURL components to their RFC 3986 components and add in additional rules that are required for PURL you can derive the encoding rules in this comment, but the rules are different for namespace and name and qualifier value and subpath, and most standard library percent encoding routines only support encoding one set of characters that is similar to, but not always exactly, the compliment of the unreserved set in either RFC 3986 or RFC 2396, which means they encode more characters than are required for RFC 3986 and less characters than are required for PURL (eg @).

0 replies

pombredanne · 2025-06-26T17:27:09Z

pombredanne
Jun 26, 2025
Maintainer

@matt-phylum I think we want to avoid having loose formatting form, and instead promote a single form which is the canonical form, and some guidance for parsers to recover non-canonical PURLs when possible. I think I learned from you in other comments that a single form would be better?
Would you agree?

1 reply

matt-phylum Jun 26, 2025
Author

I don't think a single form is feasible. Humans are going to type PURLs, and they're inevitably going to make mistakes with escaping or normalization. Even with the PURL test suite, PURL library developers have created several nearly identical canonical formats. I still think it's important for a PURL parser to have some level of forgiveness for minor nonconformities and have defined, consistent behavior for when that happens. For example, it is invalid to write a PURL pkg://generic/example, but all parsers must successfully parse this as pkg:generic/example instead of any other result. I've seen problems where PURL libraries will implement unnecessary validation to reject an unambiguous PURL because of a misinterpretation of the "MUST"s in the spec. I quickly looked at the MUSTs again today and they seem like they might be better now.

Of particular concern to me are the rules in PURL-TYPES.rst. There are a lot of them, the list changes over time, some of them incorrect wrt the package ecosystem they represent, and some of them are annoying to implement (eg PyPI). If an implementation receives a PURL pkg:pypi/ruamel.yaml, I think there are two appropriate behaviors: either the implementation does not understand PyPI and it uses the name ruamel.yaml as provided, or the implementation does understand PyPI and it normalizes the name to ruamel-yaml (#262). For the implementation to return a parse error because the name was not normalized (which PURL-TYPES.rst says must be done) would lead to greater interoperability problems where the PURL is understandable by some software but completely rejected by others, possibly incorrectly due to incorrectly implemented validation.

However, this leads to a frustrating problem. If software successfully reads an unambiguous but non-canonical PURL and PURLs are being used as unique identifiers (ignoring the problems with different levels of optional qualifier specification), there is no one right answer. If the software is a component in a larger system, canonicalizing the PURL would break the link between information about the library in that software and information about that library in the other software. If the software is maintaining a database of libraries indexed by PURL, not canonicalizing the PURL would break the link between information about that library when referenced by this non-canonical PURL vs when referenced by other PURLs. One solution to this problem and the optional qualifier problem would be for all software that uses PURLs as identifiers to have its own PURL canonicalization and comparison routines, but this may be too much to expect when dealing with tools for doing simple SBOM transformations.

pombredanne · 2025-06-26T17:27:39Z

pombredanne
Jun 26, 2025
Maintainer

I am moving to a discussion also.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specify a loose formatted form? #494

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Specify a loose formatted form? #494

Uh oh!

matt-phylum Mar 6, 2025

Replies: 4 comments · 1 reply

Uh oh!

dwalluck Mar 13, 2025

Uh oh!

matt-phylum Mar 13, 2025 Author

Uh oh!

pombredanne Jun 26, 2025 Maintainer

Uh oh!

matt-phylum Jun 26, 2025 Author

Uh oh!

pombredanne Jun 26, 2025 Maintainer

matt-phylum
Mar 6, 2025

Replies: 4 comments 1 reply

dwalluck
Mar 13, 2025

matt-phylum
Mar 13, 2025
Author

pombredanne
Jun 26, 2025
Maintainer

matt-phylum Jun 26, 2025
Author

pombredanne
Jun 26, 2025
Maintainer