@@ -989,17 +989,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
989989code point, is to store each code point as four consecutive bytes. There are two
990990possibilities: store the bytes in big endian or in little endian order. These
991991two encodings are called ``UTF-32-BE `` and ``UTF-32-LE `` respectively. Their
992- disadvantage is that if e.g. you use ``UTF-32-BE `` on a little endian machine you
993- will always have to swap bytes on encoding and decoding. ``UTF-32 `` avoids this
994- problem: bytes will always be in natural endianness. When these bytes are read
995- by a CPU with a different endianness, then bytes have to be swapped though. To
996- be able to detect the endianness of a ``UTF-16 `` or ``UTF-32 `` byte sequence,
997- there's the so called BOM ("Byte Order Mark"). This is the Unicode character
998- ``U+FEFF ``. This character can be prepended to every ``UTF-16 `` or ``UTF-32 ``
999- byte sequence. The byte swapped version of this character (``0xFFFE ``) is an
1000- illegal character that may not appear in a Unicode text. So when the
1001- first character in a ``UTF-16 `` or ``UTF-32 `` byte sequence
1002- appears to be a ``U+FFFE `` the bytes have to be swapped on decoding.
992+ disadvantage is that if, for example, you use ``UTF-32-BE `` on a little endian
993+ machine you will always have to swap bytes on encoding and decoding.
994+ Python's ``UTF-16 `` and ``UTF-32 `` codecs avoid this problem by using the
995+ platform's native byte order when no BOM is present.
996+ Python follows prevailing platform
997+ practice, so native-endian data round-trips without redundant byte swapping,
998+ even though the Unicode Standard defaults to big-endian when the byte order is
999+ unspecified. When these bytes are read by a CPU with a different endianness,
1000+ the bytes have to be swapped. To be able to detect the endianness of a
1001+ ``UTF-16 `` or ``UTF-32 `` byte sequence, a BOM ("Byte Order Mark") is used.
1002+ This is the Unicode character ``U+FEFF ``. This character can be prepended to every
1003+ ``UTF-16 `` or ``UTF-32 `` byte sequence. The byte swapped version of this character
1004+ (``0xFFFE ``) is an illegal character that may not appear in a Unicode text.
1005+ When the first character of a ``UTF-16 `` or ``UTF-32 `` byte sequence is
1006+ ``U+FFFE ``, the bytes have to be swapped on decoding.
1007+ 
10031008Unfortunately the character ``U+FEFF `` had a second purpose as
10041009a ``ZERO WIDTH NO-BREAK SPACE ``: a character that has no width and doesn't allow
10051010a word to be split. It can e.g. be used to give hints to a ligature algorithm.
0 commit comments