Skip to content

Commit 7b6fb71

Browse files
miss-islingtonPrhmmaStanFromIreland
authored
[3.13] pythongh-128571: Document UTF-16/32 native byte order (pythonGH-139974) (python#140308)
Closes pythonGH-128571 (cherry picked from commit 920de7c) Co-authored-by: Parham MohammadAlizadeh <[email protected]> Co-authored-by: Stan Ulbrych <[email protected]>
1 parent 762fbdb commit 7b6fb71

File tree

1 file changed

+16
-11
lines changed

1 file changed

+16
-11
lines changed

Doc/library/codecs.rst

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -978,17 +978,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
978978
code point, is to store each code point as four consecutive bytes. There are two
979979
possibilities: store the bytes in big endian or in little endian order. These
980980
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
981-
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
982-
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
983-
problem: bytes will always be in natural endianness. When these bytes are read
984-
by a CPU with a different endianness, then bytes have to be swapped though. To
985-
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
986-
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
987-
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
988-
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
989-
illegal character that may not appear in a Unicode text. So when the
990-
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
991-
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
981+
disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
982+
machine you will always have to swap bytes on encoding and decoding.
983+
Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
984+
platform's native byte order when no BOM is present.
985+
Python follows prevailing platform
986+
practice, so native-endian data round-trips without redundant byte swapping,
987+
even though the Unicode Standard defaults to big-endian when the byte order is
988+
unspecified. When these bytes are read by a CPU with a different endianness,
989+
the bytes have to be swapped. To be able to detect the endianness of a
990+
``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
991+
This is the Unicode character ``U+FEFF``. This character can be prepended to every
992+
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
993+
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
994+
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
995+
``U+FFFE``, the bytes have to be swapped on decoding.
996+
992997
Unfortunately the character ``U+FEFF`` had a second purpose as
993998
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
994999
a word to be split. It can e.g. be used to give hints to a ligature algorithm.

0 commit comments

Comments
 (0)