Skip to content

Commit 5c680be

Browse files
peffgitster
authored andcommitted
utf8: accept alternate spellings of UTF-8
The iconv implementation on many platforms will accept variants of UTF-8, including "UTF8", "utf-8", and "utf8", but some do not. We make allowances in our code to treat them all identically, but we sometimes hand the string from the user directly to iconv. In this case, the platform iconv may or may not work. There are really four levels of platform iconv support for these synonyms: 1. All synonyms understood (e.g., glibc). 2. Only the official "UTF-8" understood (e.g., Windows). 3. Official "UTF-8" not understood, but some other synonym understood (it's not known whether such a platform exists). 4. Neither "UTF-8" nor any synonym understood (e.g., ancient systems, or ones without utf8 support installed). This patch teaches git to fall back to using the official "UTF-8" spelling when iconv_open fails (and the encoding was one of the synonym spellings). This makes things more convenient to users of type 2 systems, as they can now use any of the synonyms for the log output encoding. Type 1 systems are not affected, as iconv already works on the first try. Type 4 systems are not affected, as both attempts already fail. Type 3 systems will not benefit from the feature, but because we only use "UTF-8" as a fallback, they will not be regressed (i.e., you can continue to use "utf8" if your platform supports it). We could try all the various synonyms, but since such systems are not even known to exist, it's not worth the effort. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 7e20105 commit 5c680be

File tree

1 file changed

+18
-2
lines changed

1 file changed

+18
-2
lines changed

utf8.c

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -480,9 +480,25 @@ char *reencode_string(const char *in, const char *out_encoding, const char *in_e
480480

481481
if (!in_encoding)
482482
return NULL;
483+
483484
conv = iconv_open(out_encoding, in_encoding);
484-
if (conv == (iconv_t) -1)
485-
return NULL;
485+
if (conv == (iconv_t) -1) {
486+
/*
487+
* Some platforms do not have the variously spelled variants of
488+
* UTF-8, so let's fall back to trying the most official
489+
* spelling. We do so only as a fallback in case the platform
490+
* does understand the user's spelling, but not our official
491+
* one.
492+
*/
493+
if (is_encoding_utf8(in_encoding))
494+
in_encoding = "UTF-8";
495+
if (is_encoding_utf8(out_encoding))
496+
out_encoding = "UTF-8";
497+
conv = iconv_open(out_encoding, in_encoding);
498+
if (conv == (iconv_t) -1)
499+
return NULL;
500+
}
501+
486502
out = reencode_string_iconv(in, strlen(in), conv);
487503
iconv_close(conv);
488504
return out;

0 commit comments

Comments
 (0)