-
Notifications
You must be signed in to change notification settings - Fork 589
perlunicode: Add discussion about malformations #23553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: blead
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1817,50 +1817,157 @@ through C<0x10FFFF>.) | |
|
||
=head2 Security Implications of Unicode | ||
|
||
First, read | ||
L<Unicode Security Considerations|https://www.unicode.org/reports/tr36>. | ||
The security implications of Unicode are quite complicated, as you might | ||
expect from it trying to handle all the world's scripts. An | ||
introduction is in | ||
L<Unicode Security Methods|https://www.unicode.org/reports/tr39>. | ||
|
||
Also, note the following: | ||
Here are a few examples of pitfalls | ||
|
||
=over 4 | ||
|
||
=item * | ||
=item Confusables | ||
|
||
Many characters in Unicode look similar enough to other characters that | ||
they could be easily confused with each other. This is true even within | ||
the same script, for example English, where the digit C<0> and a capital | ||
letter C<O> may look like each other. But people who use that script | ||
know to look out for that. | ||
|
||
In Unicode, a digit in one script may be confusable with a digit having | ||
a different value in another. A malicious website could use this to | ||
make it appear that the price of something is less than it actually is. | ||
(You can use L<perlre/Script Runs> to make sure such digits are not | ||
being inter-mixed.) | ||
|
||
This is a general problem with internet addresses. The people who give out | ||
domain names need to be careful to not give out ones that spoof other ones, | ||
(examples in L<perlre/Script Runs>). | ||
|
||
And computer program identifier names can be such that they look like | ||
something they're not, and hence could fool a code reviewer, for | ||
example. Script runs on the individual identifiers can catch many of | ||
these, but not all. All the letters in the ASCII word C<scope> have | ||
look-alikes in Cyrillic, though those do not form a real word. Using | ||
those Cyrillic letters in that order would almost certainly be an | ||
attempt at spoofing. | ||
|
||
=item Malformed text | ||
X<REPLACEMENT CHARACTER> | ||
|
||
Successful attacks have been made against websites and databases by | ||
passing strings to them that aren't actually legal; the receiver | ||
fails to realize this; and performs an action it otherwise wouldn't, | ||
based on what it thinks the input meant. Such strings are said to be | ||
"malformed" or "illformed". | ||
|
||
Vast sums of money have been lost to such attacks. It became important | ||
to not fall for them, which involves detecting malformed text and taking | ||
appropriate action (or inaction). | ||
|
||
The Unicode REPLACEMENT CHARACTER (U+FFFD) is crucial to detecting and | ||
handling these. It has no purpose other than to indicate it is a | ||
substitute for something else. It is generally displayed as a white | ||
question mark on a dark background that is shaped like a diamond (a | ||
rectangle rotated 45 degrees). | ||
|
||
When a malformed string is encountered, the code processing it should | ||
substitute the REPLACEMENT CHARACTER for it. There are now strict rules | ||
as to what parts get replaced. You should never try to infer what the | ||
replaced part was meant to be. To do so could be falling into an | ||
attacker's trap. | ||
|
||
Many of the attack vectors were not originally envisioned by Unicode's | ||
creators, nor by its implementers, such as Perl. Rules about what were | ||
acceptable strings were originally laxer, tightened as the school of hard | ||
knocks dictated. | ||
|
||
Unfortunately, the Perl interpreter's methods of working with Unicode | ||
strings were developed when we too were naive about the possibility of | ||
attacks. Because of concerns about breaking existing code and continued | ||
naivety about the consequences, there has been resistance to changing | ||
it, and so our implementation has lagged behind Unicode's requirements | ||
and recommendations. But, over the years, various improvements have | ||
been added to minimize the issues. | ||
|
||
Therefore, it is important to use the latest version of the Perl | ||
interpreter when working with Unicode. Not unil Perl v5.44 is it fully | ||
hardened against known attack vectors. And who knows what new ideas | ||
clever atackers may come up with in the future, that we will have to | ||
Comment on lines
+1894
to
+1896
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will any of these be added to ppport.h? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm working on that now |
||
change to counter. (Although no new ones have become known recently | ||
that the Unicode Standard wasn't prepared for.) And CPAN modules can | ||
easily lag behind the interpreter itself. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Everything above is perfect, dont change it. |
||
If you are coding in pure Perl, you are pretty much at the mercy of how | ||
the underlying tools handle Unicode. UTF-8 is how Unicode strings are | ||
stored internally by the interpreter. And on most platforms, most | ||
inputs, such as files, will also be in UTF-8. You should be reading | ||
those files via | ||
|
||
binmode $fh, ":encoding(UTF-8)"; | ||
|
||
Use of any other method can lead to attacks. On some platforms, files | ||
encoded in UTF-16 prevail. Use | ||
|
||
binmode $fh, ":encoding(UTF-16)"; | ||
|
||
on these. See L<Encode> for more information. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I highly disagree with this. UTF8 flag on is very rarely seen inside the interp. I've seen 4 SW projects/code bases in my whole life that had all variables in a non latin script. There is absolutely nothing wrong with doing that. Although someones career outlook is likely to be a box on the sidewalk if the leave their country, having full memorization of python/C/Perl langs, in their native non-latin human languages. In addition, severe performance penalties come with turning on the UTF8 flag. The SVt_PV-SVt_PVLV structs are incapable of storing a For GUI strings, yes, you 100% need UTF8 turned on to never slice open an emoji or skin color modifier or country flag emoji. But UTF8 flag on passwords, network addresses, things that will become unprintable fixed length binary byte arrays, like C types short int and long and size_t, Western Europe has a non-violent war going on for the last couple decades for their dots apostraphes and tails characters on 26 letter USA english. 2 code points or 1 code point? Denormalization, overlong. Not to mention yearly Unicode Comission updates that can rewrite fields in your SQL DB/name upper casing/lower casing/alpha sorted CSV files better than any malware software from . Unless its a GUI, or meaning less to transistors varlen human written freeform text. UTF8 logic needs to stay out of that library. Its a byte array, Permutations 0-255, 0-65K, 0-4 billion. that string has no other meaning to SW.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm trying to understand your point here. The text does not say anything about the UTF-8 flag. It is just giving the only way that a file encoded as UTF-8 currently can be read and be checked by the system for well-formedness. Fast Boyer Moore is continued to be used for UTF-8. I don't know that the UTF-8 flag is automatically set by Encode if not necessary. I believe it is set only if necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure it always sets the flag, but it does set it when unnecessary:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. More important to user-facing documentation, Perl (and by extension Encode) makes no guarantees whether the flag will be set when it is not necessary. |
||
|
||
If you are coding in XS, you B<need> to be using functions in the | ||
C<utf8_to_uv> family when parsing a string encoded in UTF-8. This | ||
avoids the pitfalls of earlier APIs. See L<perlapi/utf8_to_uv>. | ||
One of those pitfalls is that instead of a REPLACEMENT CHARACTER, a NUL | ||
can be returned when a malformation is encountered. This could | ||
conceivably lead to trojan strings where the second, trojan, part is | ||
hidden from code that is expecting a NUL-terminated string. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this "improvement" is self-destructive and a bad idea to put in this wiki article, but someone could link to the 5-10 CVEs and RT tickets in the last 15 years about data smuggling with a null char in the middle of a byte array aka string. Yes, 90% of current downloaded/indexed CPAN code is using encoding unaware getter setter Perl API methods. 20%-45% of tarballs, a UTF8/L1, along with custom written/malicious/insane/non-production/illogical PP script, will cause some kind of trivial waste of time to fix bug or severe business commercial corporate $ making losses bug. SEGVs, timeouts, 504 gateway timeout, CPU infinite loops. Remote code execution/priv escalation is probably paranoia. but taking down a bunch of rack servers until someone attaches perl5db.pl, yeah, thats $ loss. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm trying to impress upon the reader that not upgrading to 5.44 when it comes out has a risk associated with it. Now that I'm looking again at what got in to 5.42, that's probably good enough. So I'm open to what the consensus here becomes |
||
Note that finding a REPLACEMENT CHARACTER in your string doesn't | ||
necessarily mean there is an attack. It is a perfectly legal input | ||
character, for whatever reason. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree, WinPerl's generic C coding policy is and MS's help docs say finding There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal |
||
One of those reasons is when there is an encoding for which there are | ||
Unicode equivalents for most, but not all characters in it. You just | ||
use the REPLACEMENT CHARACTER for the missing ones. As long as most of | ||
the text is translatable, the results could be intelligible to a human | ||
reader. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be. If its AI generated auto captions, "???" "..." or nowadays AI algos just print |
||
Malformed UTF-8 | ||
This, in fact, was one of the main reasons (besides malformation | ||
substitution) for the creation of the REPLACEMENT CHARACTER in the very | ||
first version of the Unicode Standard (often abbreviated TUS). Back | ||
then, many characters were missing that have since been added (the first | ||
release had 40,000 characters; Unicode 16.0 has nearly 300K). | ||
|
||
UTF-8 is very structured, so many combinations of bytes are invalid. In | ||
the past, Perl tried to soldier on and make some sense of invalid | ||
combinations, but this can lead to security holes, so now, if the Perl | ||
core needs to process an invalid combination, it will either raise a | ||
fatal error, or will replace those bytes by the sequence that forms the | ||
Unicode REPLACEMENT CHARACTER, for which purpose Unicode created it. | ||
And, transmission errors where bits get dropped can cause malformations | ||
The REPLACEMENT CHARACTER gets substituted, and the results may be | ||
legible as long as the error rate is low enough. | ||
|
||
Every code point can be represented by more than one possible | ||
syntactically valid UTF-8 sequence. Early on, both Unicode and Perl | ||
considered any of these to be valid, but now, all sequences longer | ||
than the shortest possible one are considered to be malformed. | ||
Now that Unicode is far more complete, and the odds of finding a | ||
character that Unicode doesn't know about are far lower, the primary use | ||
of the REPLACEMENT CHARACTER is to substitute for malformations in | ||
strings. | ||
|
||
khwilliamson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
=item Illegal code points | ||
|
||
Unicode considers many code points to be illegal, or to be avoided. | ||
Perl generally accepts them, once they have passed through any input | ||
Perl generally accepts them anyway, once they have passed through any input | ||
filters that may try to exclude them. These have been discussed above | ||
(see "Surrogates" under UTF-16 in L</Unicode Encodings>, | ||
L</Noncharacter code points>, and L</Beyond Unicode code points>). | ||
|
||
=item * | ||
If you are writing in XS, the L<perlapi/utf8_to_uv> family of functions | ||
has ones that can exclude common varieties of these. In particular, | ||
C<strict_utf8_to_uv>, excludes all but the most restrictive set defined | ||
by TUS. | ||
|
||
=back | ||
|
||
=head2 Regular Expressions | ||
|
||
Regular expression pattern matching may surprise you if you're not | ||
accustomed to Unicode. Starting in Perl 5.14, several pattern | ||
modifiers are available to control this, called the character set | ||
modifiers. Details are given in L<perlre/Character set modifiers>. | ||
|
||
=back | ||
|
||
As discussed elsewhere, Perl has one foot (two hooves?) planted in | ||
each of two worlds: the old world of ASCII and single-byte locales, and | ||
the new world of Unicode, upgrading when necessary. | ||
If your legacy code does not explicitly use Unicode, no automatic | ||
switch-over to Unicode should happen. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph has got to go,its humor in 2025 and a chuckle, but def not for a college student to read. Latin-1 is a synonym for raw binary/hex dumps. L1 isn't a transport protocol for data exchange anymore. Latin 1 doesn't exist over a copper wire longer than a USB cable. Most IDEs/OSes should probably permanently switch Latin-1 fonts drawing "known" Latin-1 text, with tiny hexadecimal emoji font characters. I've used those tiny hexadecimal font files in the recent past. IDN KOI8R were very good solutions at the Unicode homonym glyph security attacks. When in double, start drawing \xff ascii escape codes to a non-IT medical worker. Atleast then you can record the patients obvious wrong name, but ZERO DATA loss name with a ball point pen. Thats the USA Social Security Admin's official policy BTW, and Chinese has deterministic 2 way reversible latin-izing protocol for the last couple decades, and WWW IDN spec's rules. Thats also WinPerl's official legal policy located at You might wanna expand this article with basic data scrubbing sanitization algorithms, like locks all JSON fields to exactly 1 govt recognized script, no mix and match between chinese and english in a JSON string. Thats a solution I've heard circulate in the perl community a couple times. One Perl company had to pay language consultants to sanitize the official Unicode commission database, since those character property are best effort, and not actually "secure" with real world written text in a newspaper, or a sign on the wall in that country. Letters removed by a Dept of Ed/Ministry of Education in the 1930s-1960s, should never enter a SQL DB field in the 2020s. Something is wrong when the last birth certificate or govt ID ever issued with that letter was 112 years ago and that 112 year old person is registering a new account with you. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That's a problem for the application developer, not the interpreter. Not the Unicode standard. Not the JSON library. If I want to write about this 112 year old person in comments, or in my code, or on a web page, or even with pen and paper it kinda feels like my programming language shouldn't give me a This bunny is existing is not a crime:
|
||
=head2 Unicode in Perl on EBCDIC | ||
|
||
Unicode is supported on EBCDIC platforms. See L<perlebcdic>. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rest of a sentence after a comma being parenthetical seems strange to me .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed; I'll fix before pushing