-
Notifications
You must be signed in to change notification settings - Fork 9.2k
fix character terminology in ABNF comments #4965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix character terminology in ABNF comments #4965
Conversation
UTF-8 is not a character set; it is an encoding. The character set we are using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert to using the correct terminology. ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode character except..."
I think UTF-8 is correct here. The ABNF describes octet sequences in a JSON or YAML text. And https://www.rfc-editor.org/rfc/rfc6570#section-2.1 also mentions UTF-8:
|
No, it doesn't. Are we looking at the same section? In https://handrews.github.io/renderings/oas/deploy-preview/oas.html#path-templating and https://handrews.github.io/renderings/oas/deploy-preview/oas.html#server-variable-object, the ABNF shows sequences from %x00 to %x10FFFF. That's not UTF-8 octets -- that's Unicode character sequences. These are decoded characters, not a series of UTF-8 encoded octets. Further, RFC6570 is clear in several sections that all expansion is done in terms of Unicode codepoints (https://www.rfc-editor.org/rfc/rfc6570#section-1.6, https://www.rfc-editor.org/rfc/rfc6570#section-2, https://www.rfc-editor.org/rfc/rfc6570#section-3.2.1 etc). The |
A final thought: the ABNF in RFC6570 (and also section 1.5) is definitely confusing, and some of it came as a surprise to me -- it's saying that if there is some percent encoding in there, it could be a mix of UTF-8 encoded octets and some Unicode codepoints -- and the way for a reader to tell the difference is to count the number of hexadecimal characters after the percent sign! This is really bizarre and I can't imagine why someone would want to mix encoded and unencoded characters together. Hopefully in the wild we'd only see one or the other, but to handle both in the same string and have to decipher which is which is just gross. It's certainly not something I'm doing right now in my implementation, so I'm going to have to take a look at see if my web framework's URI class handles all this natively or not (I treat the server url prefix as an URL, and un-encode the (thank you for coming to my TED Talk) edit: and I'm a bit wrong here. Where the ABNF is saying |
..and I see https://www.rfc-editor.org/errata/eid6937, well done @baywet :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! 👍
UTF-8 is not a character set; it is an encoding. The character set we are using is Unicode (the full range of integers from \x00 to \x10FFFF), so revert to using the correct terminology.
ref.: https://www.rfc-editor.org/rfc/rfc6570#section-2.1 uses "any Unicode character except..."
(We could encode the path template and server url template ABNFs into the schema as regexes, but I'm content to leave that to post-3.2.)