@@ -978,17 +978,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
978978code point, is to store each code point as four consecutive bytes. There are two
979979possibilities: store the bytes in big endian or in little endian order. These
980980two encodings are called ``UTF-32-BE `` and ``UTF-32-LE `` respectively. Their
981- disadvantage is that if e.g. you use ``UTF-32-BE `` on a little endian machine you
982- will always have to swap bytes on encoding and decoding. ``UTF-32 `` avoids this
983- problem: bytes will always be in natural endianness. When these bytes are read
984- by a CPU with a different endianness, then bytes have to be swapped though. To
985- be able to detect the endianness of a ``UTF-16 `` or ``UTF-32 `` byte sequence,
986- there's the so called BOM ("Byte Order Mark"). This is the Unicode character
987- ``U+FEFF ``. This character can be prepended to every ``UTF-16 `` or ``UTF-32 ``
988- byte sequence. The byte swapped version of this character (``0xFFFE ``) is an
989- illegal character that may not appear in a Unicode text. So when the
990- first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
991- appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
981+ disadvantage is that if, for example, you use ``UTF-32-BE `` on a little endian
982+ machine you will always have to swap bytes on encoding and decoding.
983+ Python's ``UTF-16 `` and ``UTF-32 `` codecs avoid this problem by using the
984+ platform's native byte order when no BOM is present.
985+ Python follows prevailing platform
986+ practice, so native-endian data round-trips without redundant byte swapping,
987+ even though the Unicode Standard defaults to big-endian when the byte order is
988+ unspecified. When these bytes are read by a CPU with a different endianness,
989+ the bytes have to be swapped. To be able to detect the endianness of a
990+ ``UTF-16 `` or ``UTF-32 `` byte sequence, a BOM ("Byte Order Mark") is used.
991+ This is the Unicode character ``U+FEFF ``. This character can be prepended to every
992+ ``UTF-16 `` or ``UTF-32 `` byte sequence. The byte swapped version of this character
993+ (``0xFFFE ``) is an illegal character that may not appear in a Unicode text.
994+ When the first character of a ``UTF-16 `` or ``UTF-32 `` byte sequence is
995+ ``U+FFFE ``, the bytes have to be swapped on decoding.
996+
992997Unfortunately the character ``U+FEFF `` had a second purpose as
993998a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
994999a word to be split. It can e.g. be used to give hints to a ligature algorithm.
0 commit comments