Skip to content
ericprud edited this page Mar 15, 2013 · 5 revisions

Lexing UTF-8

http://www.w3.org/TR/turtle/#grammar-production-PN_CHARS_BASE is defined in terms of unicode characters. This is trivially converted to a UTF-8 parser by e.g. http://www.w3.org/2005/03/23-lex-U:

AZ [A-Z] A-Z [A-Z]
az [a-z] a-z |[a-z]
ÀÖ [#x00C0-#x00D6] c380-c396 |\xC3[\x80-\x96]
Øö [#x00D8-#x00F6] c398-c3b6 |\xC3[\x98-\xB6]
ø [#x00F8-#x02FF] c3b8-cbbf |\xC3[\xB8-\xBF]|[\xC4-\xCB][\x80-\xBF]
[#x0370-#x037D] cdb0-cdbd |\xCD[\xB0-\xBD]
[#x037F-#x1FFF] cdbf-e1bfbf |\xCD\xBF|[\xCE-\xDF][\x80-\xBF]|\xE0[\xA0-\xBF][\x80-\xBF]|\xE1[\x80-\xBF][\x80-\xBF]
[#x200C-#x200D] e2808c-e2808d |\xE2\x80[\x8C-\x8D]
[#x2070-#x218F] e281b0-e2868f |\xE2(\x81[\xB0-\xBF]|[\x82-\x85][\x80-\xBF]|\x86[\x80-\x8F])
[#x2C00-#x2FEF e2b080-e2bfaf |\xE2([\xB0-\xBE][\x80-\xBF]|\xBF[\x80-\xAF])
[#x3001-#xD7FF] e38081-ed9fbf |\xE3(\x80[\x81-\xBF]|[\x81-\xBF][\x80-\xBF])|[\xE4-\xEC][\x80-\xBF][\x80-\xBF]|[\xE1-\xEC][\x80-\xBF][\x80-\xBF]|\xED[\x80-\x9F][\x80-\xBF]
[#xF900-#xFDCF] efa480-efb78f |\xEF([\xA4-\xB6][\x80-\xBF]|\xB7[\x80-\x8F])
[#xFDF0-#xFFFD] efb7b0-efbfbd |\xEF(\xB7[\xB0-\xBF]|[\xB8-\xBE][\x80-\xBF]|\xBF[\x80-\xBD])
[#x10000-#xEFFFF] f0908080-f3afbfbf |\xF0[\x90-\xBF][\x80-\xBF][\x80-\xBF]
|[\xF1-\xF2][\x80-\xBF][\x80-\xBF][\x80-\xBF]
|\xF3[\x80-\xAF][\x80-\xBF][\x80-\xBF]

Clone this wiki locally