Commit 3a95532
committed
ext/iconv/iconv.c: make iconv with //IGNORE conform to the docs
Some iconv implementations allow you to append "//IGNORE" to the
target encoding as an indication that iconv() should skip input
sequences that cannot be represented in that encoding. Both GNU iconv
implementations support this, and apparently so does Solaris. The
behavior has even been codified in the recent POSIX 2024 standard, but
not all implementations support it yet; musl, for example, does not.
There are actually two types of "bad sequences" that iconv can
encounter. The iconv() function translates sequences from an input
encoding to an output encoding. All versions of POSIX, as well as the
documentation for the various implementations, are clear that //IGNORE
should only ignore valid input sequences that cannot be represented in
the target encoding. If the input sequences themselves are invalid,
that is an error (EILSEQ), even when //IGNORE is used.
The PHP documentation for iconv is aligned with this. A priori, iconv
"returns the converted string, or false on failure." And,
If the string //IGNORE is appended, characters that cannot be
represented in the target charset are silently discarded. Otherwise,
E_NOTICE is generated and the function will return false.
But again, this should only have an effect on valid input
sequences. Which brings us to the problem. The current behavior of,
$text = "aa\xC3\xC3\xC3\xB8aa";
var_dump(urlencode(iconv("UTF-8", "UTF-8//IGNORE", $text)));
with glibc iconv is to print,
string(10) "aa%C3%B8aa"
even though the input $text is invalid UTF-8. The reason for this goes
back to PHP bug 48147 which was intended to address a bug in glibc's
iconv implementation with //IGNORE. Prior to PHP bug 52211, PHP's
iconv would return part of the string if iconv failed. And the glibc
bug was making it return the wrong part. So, as part of bug 48147, PHP
added an ICONV_BROKEN_IGNORE test to config.m4, and added an internal
workaround for ignoring EILSEQ errors mid-translation.
Unfortunately there are some problems with this test and the workaround:
* The test supplies an invalid input sequence to iconv() with
//IGNORE, and looks for an error. As discussed above, this should
always be an error. Any implementation conforming to any version
of POSIX should trigger ICONV_BROKEN_IGNORE. (Tested: glibc and
musl.)
* The internal workaround for ICONV_BROKEN_IGNORE ignores EILSEQ
errors. Again, this is not right, because you will only get EILSEQ
from invalid input sequences (in POSIX conforming implementations)
or when //IGNORE is NOT used (GNU implementations).
* The workaround leaves "//IGNORE" in the target encoding and so
does nothing to aid implementations like musl where //IGNORE
is "broken" because it never worked to begin with.
In short, we're always getting the workaround behavior, and the
workaround behavior runs contrary to both POSIX and the PHP
documentation. Invalid input sequences should never be ignored.
This commit removes the workaround, but will be a breaking change for
anyone using //IGNORE on invalid inputs.1 parent 27a1d69 commit 3a95532
1 file changed
+0
-29
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
415 | 415 | | |
416 | 416 | | |
417 | 417 | | |
418 | | - | |
419 | | - | |
420 | | - | |
421 | | - | |
422 | | - | |
423 | | - | |
424 | | - | |
425 | | - | |
426 | | - | |
427 | | - | |
428 | | - | |
429 | | - | |
430 | | - | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | | - | |
436 | 418 | | |
437 | 419 | | |
438 | 420 | | |
| |||
442 | 424 | | |
443 | 425 | | |
444 | 426 | | |
445 | | - | |
446 | 427 | | |
447 | 428 | | |
448 | 429 | | |
| |||
466 | 447 | | |
467 | 448 | | |
468 | 449 | | |
469 | | - | |
470 | | - | |
471 | | - | |
472 | | - | |
473 | | - | |
474 | | - | |
475 | | - | |
476 | | - | |
477 | | - | |
478 | | - | |
479 | 450 | | |
480 | 451 | | |
481 | 452 | | |
| |||
0 commit comments