File tree Expand file tree Collapse file tree 1 file changed +17
-0
lines changed Expand file tree Collapse file tree 1 file changed +17
-0
lines changed Original file line number Diff line number Diff line change @@ -1000,6 +1000,23 @@ byte sequence. The byte swapped version of this character (``0xFFFE``) is an
10001000illegal character that may not appear in a Unicode text. So when the
10011001first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
10021002appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
1003+
1004+ .. note ::
1005+
1006+ **Python UTF-16 and UTF-32 Codec Behavior **
1007+
1008+ Python's ``UTF-16 `` and ``UTF-32 `` codecs (when used without an explicit
1009+ byte order suffix like ``-BE `` or ``-LE ``) follow the platform's native
1010+ byte order when no BOM is present. This differs from the Unicode Standard
1011+ specification, which states that UTF-16 and UTF-32 encoding schemes should
1012+ default to big-endian byte order when no BOM is present and no higher-level
1013+ protocol specifies the byte order.
1014+
1015+ This behavior was chosen for practical compatibility reasons, as it avoids
1016+ byte swapping on the most common platforms, but developers should be aware
1017+ of this difference when exchanging data with systems that strictly follow
1018+ the Unicode specification.
1019+
10031020Unfortunately the character ``U+FEFF `` had a second purpose as
10041021a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
10051022a word to be split. It can e.g. be used to give hints to a ligature algorithm.
You can’t perform that action at this time.
0 commit comments