Skip to content

Commit 7be317f

Browse files
miss-islingtonPrhmmaStanFromIreland
authored
[3.14] pythongh-128571: Document UTF-16/32 native byte order (pythonGH-139974) (python#140309)
Closes pythonGH-128571 (cherry picked from commit 920de7c) Co-authored-by: Parham MohammadAlizadeh <[email protected]> Co-authored-by: Stan Ulbrych <[email protected]>
1 parent 1d11627 commit 7be317f

File tree

1 file changed

+16
-11
lines changed

1 file changed

+16
-11
lines changed

Doc/library/codecs.rst

Lines changed: 16 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -982,17 +982,22 @@ defined in Unicode. A simple and straightforward way that can store each Unicode
982982
code point, is to store each code point as four consecutive bytes. There are two
983983
possibilities: store the bytes in big endian or in little endian order. These
984984
two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their
985-
disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you
986-
will always have to swap bytes on encoding and decoding. ``UTF-32`` avoids this
987-
problem: bytes will always be in natural endianness. When these bytes are read
988-
by a CPU with a different endianness, then bytes have to be swapped though. To
989-
be able to detect the endianness of a ``UTF-16`` or ``UTF-32`` byte sequence,
990-
there's the so called BOM ("Byte Order Mark"). This is the Unicode character
991-
``U+FEFF``. This character can be prepended to every ``UTF-16`` or ``UTF-32``
992-
byte sequence. The byte swapped version of this character (``0xFFFE``) is an
993-
illegal character that may not appear in a Unicode text. So when the
994-
first character in a ``UTF-16`` or ``UTF-32`` byte sequence
995-
appears to be a ``U+FFFE`` the bytes have to be swapped on decoding.
985+
disadvantage is that if, for example, you use ``UTF-32-BE`` on a little endian
986+
machine you will always have to swap bytes on encoding and decoding.
987+
Python's ``UTF-16`` and ``UTF-32`` codecs avoid this problem by using the
988+
platform's native byte order when no BOM is present.
989+
Python follows prevailing platform
990+
practice, so native-endian data round-trips without redundant byte swapping,
991+
even though the Unicode Standard defaults to big-endian when the byte order is
992+
unspecified. When these bytes are read by a CPU with a different endianness,
993+
the bytes have to be swapped. To be able to detect the endianness of a
994+
``UTF-16`` or ``UTF-32`` byte sequence, a BOM ("Byte Order Mark") is used.
995+
This is the Unicode character ``U+FEFF``. This character can be prepended to every
996+
``UTF-16`` or ``UTF-32`` byte sequence. The byte swapped version of this character
997+
(``0xFFFE``) is an illegal character that may not appear in a Unicode text.
998+
When the first character of a ``UTF-16`` or ``UTF-32`` byte sequence is
999+
``U+FFFE``, the bytes have to be swapped on decoding.
1000+
9961001
Unfortunately the character ``U+FEFF`` had a second purpose as
9971002
a ``ZERO WIDTH NO-BREAK SPACE``: a character that has no width and doesn't allow
9981003
a word to be split. It can e.g. be used to give hints to a ligature algorithm.

0 commit comments

Comments
 (0)