Skip to content

Commit 920de7c

Browse files
gh-128571: Document UTF-16/32 native byte order (#139974)
Closes #128571 Co-authored-by: Stan Ulbrych <[email protected]>
1 parent d86ad87 commit 920de7c

File tree

1 file changed

+16
-11
lines changed

1 file changed

+16
-11
lines changed

Doc/library/codecs.rst

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -989,17 +989,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
989989
code point, is to store each code point as four consecutive bytes. There are two
990990
possibilities: store the bytes in big endian or in little endian order. These
991991
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
992-
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
993-
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
994-
problem: bytes will always be in natural endianness. When these bytes are read
995-
by a CPU with a different endianness, then bytes have to be swapped though. To
996-
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
997-
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
998-
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
999-
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
1000-
illegal character that may not appear in a Unicode text. So when the
1001-
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
1002-
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
992+
disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
993+
machine you will always have to swap bytes on encoding and decoding.
994+
Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
995+
platform's native byte order when no BOM is present.
996+
Python follows prevailing platform
997+
practice, so native-endian data round-trips without redundant byte swapping,
998+
even though the Unicode Standard defaults to big-endian when the byte order is
999+
unspecified. When these bytes are read by a CPU with a different endianness,
1000+
the bytes have to be swapped. To be able to detect the endianness of a
1001+
``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
1002+
This is the Unicode character ``U+FEFF``. This character can be prepended to every
1003+
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
1004+
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
1005+
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
1006+
``U+FFFE``, the bytes have to be swapped on decoding.
1007+
10031008
Unfortunately the character ``U+FEFF`` had a second purpose as
10041009
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
10051010
a word to be split. It can e.g. be used to give hints to a ligature algorithm.

0 commit comments

Comments
 (0)