Skip to content

Commit c6e4865

Browse files
larsxschneidergitster
authored andcommitted
utf8: add function to detect a missing UTF-16/32 BOM
If the endianness is not defined in the encoding name, then let's be strict and require a BOM to avoid any encoding confusion. The is_missing_required_utf_bom() function returns true if a required BOM is missing. The Unicode standard instructs to assume big-endian if there in no BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used in HTML5 recommends to assume little-endian to "deal with deployed content" [3]. Strictly requiring a BOM seems to be the safest option for content in Git. This function is used in a subsequent commit. [1] http://unicode.org/faq/utf_bom.html#gen6 [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf Section 3.10, D98, page 132 [3] https://encoding.spec.whatwg.org/#utf-16le Signed-off-by: Lars Schneider <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
1 parent 10ecb82 commit c6e4865

File tree

2 files changed

+32
-0
lines changed

2 files changed

+32
-0
lines changed

utf8.c

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -586,6 +586,19 @@ int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
586586
);
587587
}
588588

589+
int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
590+
{
591+
return (
592+
(same_utf_encoding(enc, "UTF-16")) &&
593+
!(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
594+
has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
595+
) || (
596+
(same_utf_encoding(enc, "UTF-32")) &&
597+
!(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
598+
has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
599+
);
600+
}
601+
589602
/*
590603
* Returns first character length in bytes for multi-byte `text` according to
591604
* `encoding`.

utf8.h

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int wid
7979
*/
8080
int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
8181

82+
/*
83+
* If the endianness is not defined in the encoding name, then we
84+
* require a BOM. The function returns true if a required BOM is missing.
85+
*
86+
* The Unicode standard instructs to assume big-endian if there in no
87+
* BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
88+
* used in HTML5 recommends to assume little-endian to "deal with
89+
* deployed content" [3].
90+
*
91+
* Therefore, strictly requiring a BOM seems to be the safest option for
92+
* content in Git.
93+
*
94+
* [1] http://unicode.org/faq/utf_bom.html#gen6
95+
* [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
96+
* Section 3.10, D98, page 132
97+
* [3] https://encoding.spec.whatwg.org/#utf-16le
98+
*/
99+
int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
100+
82101
#endif

0 commit comments

Comments
 (0)