fix character terminology in ABNF comments #4965

karenetheridge · 2025-09-16T18:40:17Z

UTF-8 is not a character set; it is an encoding. The character set we are using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert to using the correct terminology.

ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode character except..."

(We could encode the path template and server url template ABNFs into the schema as regexes, but I'm content to leave that to post-3.2.)

no schema changes are needed for this pull request

UTF-8 is not a character set; it is an encoding. The character set we are using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert to using the correct terminology. ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode character except..."

ralfhandl · 2025-09-16T19:01:32Z

I think UTF-8 is correct here. The ABNF describes octet sequences in a JSON or YAML text.

And https://www.rfc-editor.org/rfc/rfc6570#section-2.1 also mentions UTF-8:

sequence of pct-encoded triplets corresponding to that character's encoding in UTF-8

karenetheridge · 2025-09-17T17:16:47Z

The ABNF describes octet sequences in a JSON or YAML text.

No, it doesn't. Are we looking at the same section? In https://handrews.github.io/renderings/oas/deploy-preview/oas.html#path-templating and https://handrews.github.io/renderings/oas/deploy-preview/oas.html#server-variable-object, the ABNF shows sequences from %x00 to %x10FFFF. That's not UTF-8 octets -- that's Unicode character sequences. These are decoded characters, not a series of UTF-8 encoded octets.

Further, RFC6570 is clear in several sections that all expansion is done in terms of Unicode codepoints (https://www.rfc-editor.org/rfc/rfc6570#section-1.6, https://www.rfc-editor.org/rfc/rfc6570#section-2, https://www.rfc-editor.org/rfc/rfc6570#section-3.2.1 etc).

The pct-encoded ABNF is UTF-8- and percent-encoded sequences, but the rest of the ABNF is not. But my edit is not in the pct-encoded section -- it is in the section where Unicode codepoints are described.

karenetheridge · 2025-09-17T19:30:50Z

A final thought: the ABNF in RFC6570 (and also section 1.5) is definitely confusing, and some of it came as a surprise to me -- it's saying that if there is some percent encoding in there, it could be a mix of UTF-8 encoded octets and some Unicode codepoints -- and the way for a reader to tell the difference is to count the number of hexadecimal characters after the percent sign! This is really bizarre and I can't imagine why someone would want to mix encoded and unencoded characters together. Hopefully in the wild we'd only see one or the other, but to handle both in the same string and have to decipher which is which is just gross. It's certainly not something I'm doing right now in my implementation, so I'm going to have to take a look at see if my web framework's URI class handles all this natively or not (I treat the server url prefix as an URL, and un-encode the { and } characters after it has mistakenly translated them to %7B and %7D characters, so as to use it as a uri base for the path template string).

(thank you for coming to my TED Talk)

edit: and I'm a bit wrong here. Where the ABNF is saying %xA0-D7FF etc, it is not saying that this is a percent-encoded unicode character - it's just the literal unicode character itself (I was confused by the ABNF using percent encoding itself to represent a literal -- but this was added in this amendment to the ABNF format here: https://www.rfc-editor.org/rfc/rfc5234#section-3.4). And when we see percent encoding, it's UTF-8-encoded first, so we have a sequence of hex tuplets. But they're still mixing unicode chars with percent-encoded UTF-8 encoded bytes, which is still gross (but is the thing to do to encode non-ascii characters in the path part of URIs).

karenetheridge · 2025-09-17T19:47:07Z

..and I see https://www.rfc-editor.org/errata/eid6937, well done @baywet :)

mikekistler

Looks good! 👍

karenetheridge requested review from a team as code owners September 16, 2025 18:40

karenetheridge added this to the v3.2.0 milestone Sep 16, 2025

ralfhandl approved these changes Sep 17, 2025

View reviewed changes

ralfhandl requested a review from a team September 17, 2025 18:57

mikekistler approved these changes Sep 18, 2025

View reviewed changes

mikekistler merged commit 16b60f8 into OAI:v3.2-dev Sep 18, 2025
2 checks passed

karenetheridge deleted the ether/v3.2-ABNF-amendments branch September 18, 2025 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix character terminology in ABNF comments #4965

fix character terminology in ABNF comments #4965

Uh oh!

karenetheridge commented Sep 16, 2025

Uh oh!

ralfhandl commented Sep 16, 2025 •

edited

Loading

Uh oh!

karenetheridge commented Sep 17, 2025

Uh oh!

karenetheridge commented Sep 17, 2025 •

edited

Loading

Uh oh!

karenetheridge commented Sep 17, 2025

Uh oh!

mikekistler left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix character terminology in ABNF comments #4965

fix character terminology in ABNF comments #4965

Uh oh!

Conversation

karenetheridge commented Sep 16, 2025

Uh oh!

ralfhandl commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karenetheridge commented Sep 17, 2025

Uh oh!

karenetheridge commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karenetheridge commented Sep 17, 2025

Uh oh!

mikekistler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ralfhandl commented Sep 16, 2025 •

edited

Loading

karenetheridge commented Sep 17, 2025 •

edited

Loading