Skip to content

Commit ab396c3

Browse files
committed
perlunicode: Add discussion about malformations
1 parent b539310 commit ab396c3

File tree

1 file changed

+151
-28
lines changed

1 file changed

+151
-28
lines changed

pod/perlunicode.pod

Lines changed: 151 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1817,50 +1817,173 @@ through C<0x10FFFF>.)
18171817

18181818
=head2 Security Implications of Unicode
18191819

1820-
First, read
1821-
L<Unicode Security Considerations|https://www.unicode.org/reports/tr36>.
1820+
The security implications of Unicode are quite complicated, as you might
1821+
expect from it trying to handle all the world's scripts. An
1822+
introduction is in
1823+
L<Unicode Security Methods|https://www.unicode.org/reports/tr39>.
18221824

1823-
Also, note the following:
1825+
Here are a few examples of pitfalls
18241826

18251827
=over 4
18261828

1827-
=item *
1828-
1829-
Malformed UTF-8
1830-
1831-
UTF-8 is very structured, so many combinations of bytes are invalid. In
1832-
the past, Perl tried to soldier on and make some sense of invalid
1833-
combinations, but this can lead to security holes, so now, if the Perl
1834-
core needs to process an invalid combination, it will either raise a
1835-
fatal error, or will replace those bytes by the sequence that forms the
1836-
Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it.
1837-
1838-
Every code point can be represented by more than one possible
1839-
syntactically valid UTF-8 sequence. Early on, both Unicode and Perl
1840-
considered any of these to be valid, but now, all sequences longer
1841-
than the shortest possible one are considered to be malformed.
1829+
=item Confusables
1830+
1831+
Many characters in Unicode look similar enough to other characters that
1832+
they could be easily confused with each other. This is true even within
1833+
the same script, for example English, where the digit C<0> and a capital
1834+
letter C<O> may look like each other. But people who use that script
1835+
know to look out for that.
1836+
1837+
In Unicode, a digit in one script may be confusable with a digit having
1838+
a different numeric value in another. A malicious website could use
1839+
this to make it appear that the price of something is less than what
1840+
you actually get charged for. (You can use L<perlre/Script Runs> to
1841+
make sure such digits are not being inter-mixed.)
1842+
1843+
This is a general problem with internet addresses. The people who give out
1844+
domain names need to be careful to not give out ones that spoof other ones
1845+
(examples in L<perlre/Script Runs>).
1846+
1847+
And computer program identifier names can be such that they look like
1848+
something they're not, and hence could fool a code reviewer, for
1849+
example. Script runs on the individual identifiers can catch many of
1850+
these, but not all. All the letters in the ASCII word C<scope> have
1851+
look-alikes in Cyrillic, though those do not form a real word. Using
1852+
those Cyrillic letters in that order would almost certainly be an
1853+
attempt at spoofing.
1854+
1855+
=item Malformed text
1856+
X<REPLACEMENT CHARACTER>
1857+
1858+
Successful attacks have been made against websites and databases by
1859+
passing strings to them that aren't actually legal; the receiver
1860+
fails to realize this; and performs an action it otherwise wouldn't,
1861+
based on what it thinks the input meant. Such strings are said to be
1862+
"malformed" or "illformed".
1863+
1864+
Vast sums of money have been lost to such attacks. It became important
1865+
to not fall for them, which involves detecting malformed text and taking
1866+
appropriate action (or inaction).
1867+
1868+
The Unicode REPLACEMENT CHARACTER (U+FFFD) is crucial to detecting and
1869+
handling these. It has no purpose other than to indicate it is a
1870+
substitute for something else. It is generally displayed as a white
1871+
question mark on a dark background that is shaped like a diamond (a
1872+
rectangle rotated 45 degrees).
1873+
1874+
When a malformed string is encountered, the code processing it should
1875+
substitute the REPLACEMENT CHARACTER for it. There are now strict rules
1876+
as to what parts get replaced. You should never try to infer what the
1877+
replaced part was meant to be. To do so could be falling into an
1878+
attacker's trap.
1879+
1880+
Many of the attack vectors were not originally envisioned by Unicode's
1881+
creators, nor by its implementers, such as Perl. Rules about what were
1882+
acceptable strings were originally laxer, tightened as the school of hard
1883+
knocks dictated.
1884+
1885+
Unfortunately, the Perl interpreter's methods of working with Unicode
1886+
strings were developed when we too were naive about the possibility of
1887+
attacks. Because of concerns about breaking existing code and continued
1888+
naivety about the consequences, there has been resistance to changing
1889+
it, and so our implementation has lagged behind Unicode's requirements
1890+
and recommendations. But, over the years, various improvements have
1891+
been added to minimize the issues.
1892+
1893+
Therefore, it is important to use the latest version of the Perl
1894+
interpreter when working with Unicode. Not unil Perl v5.44 is it fully
1895+
hardened against known attack vectors. And who knows what new ideas
1896+
clever atackers may come up with in the future that we will have to
1897+
change to counter. (Although no new ones have become known recently
1898+
that the Unicode Standard wasn't prepared for.) And CPAN modules can
1899+
easily lag behind the interpreter itself.
1900+
1901+
Note that finding a REPLACEMENT CHARACTER in your string doesn't
1902+
necessarily mean there is an attack. It is a perfectly legal input
1903+
character, for whatever reason.
1904+
1905+
One of those reasons is when there is an encoding for which there are
1906+
Unicode equivalents for most, but not all characters in it. You just
1907+
use the REPLACEMENT CHARACTER for the missing ones. As long as most of
1908+
the text is translatable, the results could be intelligible to a human
1909+
reader.
1910+
1911+
This, in fact, was one of the main reasons (besides malformation
1912+
substitution) for the creation of the REPLACEMENT CHARACTER in the very
1913+
first version of the Unicode Standard (often abbreviated TUS). Back
1914+
then, many characters were missing that have since been added (the first
1915+
release had 40,000 characters; Unicode 16.0 has nearly 300K).
1916+
1917+
And, transmission errors where bits get dropped or disk sector failures
1918+
can also cause malformations. The REPLACEMENT CHARACTER gets
1919+
substituted, and the results may be legible to a human as long as the
1920+
error rate is low enough.
1921+
1922+
Now that Unicode is far more complete, and the odds of finding a
1923+
character that Unicode doesn't know about are far lower, the primary use
1924+
of the REPLACEMENT CHARACTER is to substitute for malformations in
1925+
strings.
1926+
1927+
When you are programming in pure Perl, you end up relying on the
1928+
underlying interpreter and modules to handle these kinds of nuances.
1929+
You are responsible, however, for knowing the encoding(s) needed for
1930+
your program to interact with the outside world. For example, a common
1931+
encoding for files is UTF-8. You could use the following to read one:
1932+
1933+
use PerlIO::encoding;
1934+
my $path = "path-to-UTF-8-file";
1935+
open my $fh, "<:encoding(UTF-8)", $path
1936+
or die "Couldn't open $path: $!";
1937+
1938+
This, behind-the-scenes, uses the L<Encode> module to translate the
1939+
contents of C<$path> to something Perl can understand. L<C<Encode>>
1940+
knows how to handle a wide variety of encodings. Use this paradigm as
1941+
well to output to a file;
1942+
1943+
use PerlIO::encoding;
1944+
my $out_path = "path-to-UTF-8-file";
1945+
open my $fh, ">:encoding(UTF-8)", $out_path
1946+
or die "Couldn't open $out_path: $!";
1947+
1948+
(You can also use L<perlfunc/C<binmode>> to change the encoding of an
1949+
already-open file.)
1950+
1951+
(There are fewer options to specifying the encoding of arguments passed
1952+
to your Perl program or to interact with environment variables. See
1953+
L<perlrun/PERL_UNICODE>.)
1954+
1955+
Skip to the end of this item unless you are writing in XS.
1956+
1957+
When writing in XS, and manipulating Unicode strings, you need to know
1958+
more about the internals. L<perlapi/Unicode Support> lists the
1959+
available API elements for working with it. UTF-8 is how Unicode
1960+
strings are currently stored internally by the interpreter. You B<need>
1961+
to be using functions in the C<utf8_to_uv> family when parsing a string
1962+
encoded in UTF-8. This avoids the pitfalls of earlier API functions,
1963+
whose names contained C<to_uvchr> instead of plain C<to_uv>.
1964+
1965+
=item Illegal code points
18421966

18431967
Unicode considers many code points to be illegal, or to be avoided.
1844-
Perl generally accepts them, once they have passed through any input
1968+
Perl generally accepts them anyway, once they have passed through any input
18451969
filters that may try to exclude them. These have been discussed above
18461970
(see "Surrogates" under UTF-16 in L</Unicode Encodings>,
18471971
L</Noncharacter code points>, and L</Beyond Unicode code points>).
18481972

1849-
=item *
1973+
If you are writing in XS, the L<perlapi/utf8_to_uv> family of functions
1974+
has ones that can exclude common varieties of these. In particular,
1975+
C<strict_utf8_to_uv>, excludes all but the most restrictive set defined
1976+
by TUS.
1977+
1978+
=back
1979+
1980+
=head2 Regular Expressions
18501981

18511982
Regular expression pattern matching may surprise you if you're not
18521983
accustomed to Unicode. Starting in Perl 5.14, several pattern
18531984
modifiers are available to control this, called the character set
18541985
modifiers. Details are given in L<perlre/Character set modifiers>.
18551986

1856-
=back
1857-
1858-
As discussed elsewhere, Perl has one foot (two hooves?) planted in
1859-
each of two worlds: the old world of ASCII and single-byte locales, and
1860-
the new world of Unicode, upgrading when necessary.
1861-
If your legacy code does not explicitly use Unicode, no automatic
1862-
switch-over to Unicode should happen.
1863-
18641987
=head2 Unicode in Perl on EBCDIC
18651988

18661989
Unicode is supported on EBCDIC platforms. See L<perlebcdic>.

0 commit comments

Comments
 (0)