Specify a loose formatted form? #494
Replies: 4 comments 1 reply
-
Almost all of the problems could be solved by simply following RFC 3986. The encoding would be solved because RFC 3986 defines the exact minimum that has to be encoded in each part. For decoding, it's difficult if not impossible to use an existing decoder because if you let it handle the decoding and encoding, then the whole path splitting/joining thing can't be done. But, I am not even sure if any of that would be necessary if the spec would just follow the existing Path Segment Normalization rules prior to decoding the entire path. Type being case-insensitive can even be gotten for free if it's treated like a |
Beta Was this translation helpful? Give feedback.
-
RFC 3986 isn't simple and does not define the formatting of a PURL. If you map the PURL components to their RFC 3986 components and add in additional rules that are required for PURL you can derive the encoding rules in this comment, but the rules are different for namespace and name and qualifier value and subpath, and most standard library percent encoding routines only support encoding one set of characters that is similar to, but not always exactly, the compliment of the unreserved set in either RFC 3986 or RFC 2396, which means they encode more characters than are required for RFC 3986 and less characters than are required for PURL (eg @). |
Beta Was this translation helpful? Give feedback.
-
@matt-phylum I think we want to avoid having loose formatting form, and instead promote a single form which is the canonical form, and some guidance for parsers to recover non-canonical PURLs when possible. I think I learned from you in other comments that a single form would be better? |
Beta Was this translation helpful? Give feedback.
-
I am moving to a discussion also. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Compared to the URL spec, PURL is both more strict and less specified. PURL has a concept of a canonical format where the scheme and type are lowercase, the namespace and name are normalized according to type-specific rules, the qualifier keys are lowercased, empty qualifiers are removed, qualifiers are sorted,
.
and..
segments are removed from subpaths, and exact sets of characters are percent encoded.This canonical form is nice because it means software can use PURLs as unique identifiers without understanding how to parse them.
However, I would be surprised if any two PURL implementations behaved the same in all cases. Most implementations make at least one mistake or intentional deviation from the spec, usually around percent encoding. This means in practice if you want to use PURLs as unique identifiers you do need to parse them at some point and convert them back to strings using a single implementation.
This has been coming up recently with the work to make the spec use RFC 2119 language. The spec is being updated to say that the formatters MUST produce the canonical output, which was always the case but with "must" instead, but the parsers are expected to accept formats that the formatters are forbidden from producing. The spec is actually describing two different PURLs: the PURLs that parsers are allowed to read and the PURLs that formatters are allowed to write.
It seems like it would make sense to instead document PURLs and canonical PURLs separately.
Benefits:
But it makes things more complicated for people who are using PURLs as unique identifiers and need to compare PURLs for equality.
Beta Was this translation helpful? Give feedback.
All reactions