Skip to content

Commit 3a95532

Browse files
committed
ext/iconv/iconv.c: make iconv with //IGNORE conform to the docs
Some iconv implementations allow you to append "//IGNORE" to the target encoding as an indication that iconv() should skip input sequences that cannot be represented in that encoding. Both GNU iconv implementations support this, and apparently so does Solaris. The behavior has even been codified in the recent POSIX 2024 standard, but not all implementations support it yet; musl, for example, does not. There are actually two types of "bad sequences" that iconv can encounter. The iconv() function translates sequences from an input encoding to an output encoding. All versions of POSIX, as well as the documentation for the various implementations, are clear that //IGNORE should only ignore valid input sequences that cannot be represented in the target encoding. If the input sequences themselves are invalid, that is an error (EILSEQ), even when //IGNORE is used. The PHP documentation for iconv is aligned with this. A priori, iconv "returns the converted string, or false on failure." And, If the string //IGNORE is appended, characters that cannot be represented in the target charset are silently discarded. Otherwise, E_NOTICE is generated and the function will return false. But again, this should only have an effect on valid input sequences. Which brings us to the problem. The current behavior of, $text = "aa\xC3\xC3\xC3\xB8aa"; var_dump(urlencode(iconv("UTF-8", "UTF-8//IGNORE", $text))); with glibc iconv is to print, string(10) "aa%C3%B8aa" even though the input $text is invalid UTF-8. The reason for this goes back to PHP bug 48147 which was intended to address a bug in glibc's iconv implementation with //IGNORE. Prior to PHP bug 52211, PHP's iconv would return part of the string if iconv failed. And the glibc bug was making it return the wrong part. So, as part of bug 48147, PHP added an ICONV_BROKEN_IGNORE test to config.m4, and added an internal workaround for ignoring EILSEQ errors mid-translation. Unfortunately there are some problems with this test and the workaround: * The test supplies an invalid input sequence to iconv() with //IGNORE, and looks for an error. As discussed above, this should always be an error. Any implementation conforming to any version of POSIX should trigger ICONV_BROKEN_IGNORE. (Tested: glibc and musl.) * The internal workaround for ICONV_BROKEN_IGNORE ignores EILSEQ errors. Again, this is not right, because you will only get EILSEQ from invalid input sequences (in POSIX conforming implementations) or when //IGNORE is NOT used (GNU implementations). * The workaround leaves "//IGNORE" in the target encoding and so does nothing to aid implementations like musl where //IGNORE is "broken" because it never worked to begin with. In short, we're always getting the workaround behavior, and the workaround behavior runs contrary to both POSIX and the PHP documentation. Invalid input sequences should never be ignored. This commit removes the workaround, but will be a breaking change for anyone using //IGNORE on invalid inputs.
1 parent 27a1d69 commit 3a95532

File tree

1 file changed

+0
-29
lines changed

1 file changed

+0
-29
lines changed

ext/iconv/iconv.c

Lines changed: 0 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -415,24 +415,6 @@ static php_iconv_err_t _php_iconv_appendc(smart_str *d, const char c, iconv_t cd
415415
}
416416
/* }}} */
417417

418-
/* {{{ */
419-
#ifdef ICONV_BROKEN_IGNORE
420-
static int _php_check_ignore(const char *charset)
421-
{
422-
size_t clen = strlen(charset);
423-
if (clen >= 9 && strcmp("//IGNORE", charset+clen-8) == 0) {
424-
return 1;
425-
}
426-
if (clen >= 19 && strcmp("//IGNORE//TRANSLIT", charset+clen-18) == 0) {
427-
return 1;
428-
}
429-
return 0;
430-
}
431-
#else
432-
#define _php_check_ignore(x) (0)
433-
#endif
434-
/* }}} */
435-
436418
/* {{{ php_iconv_string() */
437419
PHP_ICONV_API php_iconv_err_t php_iconv_string(const char *in_p, size_t in_len, zend_string **out, const char *out_charset, const char *in_charset)
438420
{
@@ -442,7 +424,6 @@ PHP_ICONV_API php_iconv_err_t php_iconv_string(const char *in_p, size_t in_len,
442424
size_t bsz, result = 0;
443425
php_iconv_err_t retval = PHP_ICONV_ERR_SUCCESS;
444426
zend_string *out_buf;
445-
int ignore_ilseq = _php_check_ignore(out_charset);
446427

447428
*out = NULL;
448429

@@ -466,16 +447,6 @@ PHP_ICONV_API php_iconv_err_t php_iconv_string(const char *in_p, size_t in_len,
466447
result = iconv(cd, (ICONV_CONST char **) &in_p, &in_left, (char **) &out_p, &out_left);
467448
out_size = bsz - out_left;
468449
if (result == (size_t)(-1)) {
469-
if (ignore_ilseq && errno == EILSEQ) {
470-
if (in_left <= 1) {
471-
result = 0;
472-
} else {
473-
errno = 0;
474-
in_p++;
475-
in_left--;
476-
continue;
477-
}
478-
}
479450

480451
if (errno == E2BIG && in_left > 0) {
481452
/* converted string is longer than out buffer */

0 commit comments

Comments
 (0)