Skip to content

Conversation

karenetheridge
Copy link
Member

UTF-8 is not a character set; it is an encoding. The character set we are using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert to using the correct terminology.

ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode character except..."

(We could encode the path template and server url template ABNFs into the schema as regexes, but I'm content to leave that to post-3.2.)

  • no schema changes are needed for this pull request

UTF-8 is not a character set; it is an encoding. The character set we are
using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert
to using the correct terminology.

ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode
character except..."
@karenetheridge karenetheridge requested review from a team as code owners September 16, 2025 18:40
@karenetheridge karenetheridge added this to the v3.2.0 milestone Sep 16, 2025
@ralfhandl
Copy link
Contributor

ralfhandl commented Sep 16, 2025

I think UTF-8 is correct here. The ABNF describes octet sequences in a JSON or YAML text.

And https://www.rfc-editor.org/rfc/rfc6570#section-2.1 also mentions UTF-8:

sequence of pct-encoded triplets corresponding to that character's encoding in UTF-8

@karenetheridge
Copy link
Member Author

The ABNF describes octet sequences in a JSON or YAML text.

No, it doesn't. Are we looking at the same section? In https://handrews.github.io/renderings/oas/deploy-preview/oas.html#path-templating and https://handrews.github.io/renderings/oas/deploy-preview/oas.html#server-variable-object, the ABNF shows sequences from %x00 to %x10FFFF. That's not UTF-8 octets -- that's Unicode character sequences. These are decoded characters, not a series of UTF-8 encoded octets.

Further, RFC6570 is clear in several sections that all expansion is done in terms of Unicode codepoints (https://www.rfc-editor.org/rfc/rfc6570#section-1.6, https://www.rfc-editor.org/rfc/rfc6570#section-2, https://www.rfc-editor.org/rfc/rfc6570#section-3.2.1 etc).

The pct-encoded ABNF is UTF-8- and percent-encoded sequences, but the rest of the ABNF is not. But my edit is not in the pct-encoded section -- it is in the section where Unicode codepoints are described.

@ralfhandl ralfhandl requested a review from a team September 17, 2025 18:57
@karenetheridge
Copy link
Member Author

karenetheridge commented Sep 17, 2025

A final thought: the ABNF in RFC6570 (and also section 1.5) is definitely confusing, and some of it came as a surprise to me -- it's saying that if there is some percent encoding in there, it could be a mix of UTF-8 encoded octets and some Unicode codepoints -- and the way for a reader to tell the difference is to count the number of hexadecimal characters after the percent sign! This is really bizarre and I can't imagine why someone would want to mix encoded and unencoded characters together. Hopefully in the wild we'd only see one or the other, but to handle both in the same string and have to decipher which is which is just gross. It's certainly not something I'm doing right now in my implementation, so I'm going to have to take a look at see if my web framework's URI class handles all this natively or not (I treat the server url prefix as an URL, and un-encode the { and } characters after it has mistakenly translated them to %7B and %7D characters, so as to use it as a uri base for the path template string).

(thank you for coming to my TED Talk)

edit: and I'm a bit wrong here. Where the ABNF is saying %xA0-D7FF etc, it is not saying that this is a percent-encoded unicode character - it's just the literal unicode character itself (I was confused by the ABNF using percent encoding itself to represent a literal -- but this was added in this amendment to the ABNF format here: https://www.rfc-editor.org/rfc/rfc5234#section-3.4). And when we see percent encoding, it's UTF-8-encoded first, so we have a sequence of hex tuplets. But they're still mixing unicode chars with percent-encoded UTF-8 encoded bytes, which is still gross (but is the thing to do to encode non-ascii characters in the path part of URIs).

@karenetheridge
Copy link
Member Author

..and I see https://www.rfc-editor.org/errata/eid6937, well done @baywet :)

Copy link
Contributor

@mikekistler mikekistler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 👍

@mikekistler mikekistler merged commit 16b60f8 into OAI:v3.2-dev Sep 18, 2025
2 checks passed
@karenetheridge karenetheridge deleted the ether/v3.2-ABNF-amendments branch September 18, 2025 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants