Skip to content

Commit 02c2371

Browse files
committed
reword comment
1 parent 429cb1e commit 02c2371

File tree

1 file changed

+19
-15
lines changed
  • firebase-firestore/src/main/java/com/google/firebase/firestore/util

1 file changed

+19
-15
lines changed

firebase-firestore/src/main/java/com/google/firebase/firestore/util/Util.java

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -94,21 +94,25 @@ public static int compareUtf8Strings(String left, String right) {
9494
return 0;
9595
}
9696

97-
// Find the first differing characters in the strings and, if found, use them to determine the
98-
// overall comparison result. This simple and efficient formula serendipitously works because
99-
// of the properties of UTF-8 and UTF-16 encodings; that is, if both UTF-16 characters are
100-
// surrogates or both are non-surrogates then the relative ordering of those individual
101-
// characters is the same as the relative ordering of the lexicographical ordering of the UTF-8
102-
// encoding of those characters (or character pairs, in the case of surrogate pairs). Also, if
103-
// one is a surrogate and the other is not then it is assumed to be the high surrogate of a
104-
// surrogate pair (otherwise it would not constitute a valid surrogate pair) and, therefore,
105-
// would necessarily be ordered _after_ the non-surrogate because all surrogate pairs represent
106-
// characters with code points above 0xFFFF and such characters produce a 4-byte UTF-8 encoding
107-
// whose first byte is 11110xxx, and since the other character is a non-surrogate it represents
108-
// a character with a code point less than or equal to 0xFFFF and produces a 1-byte, 2-byte, or
109-
// 3-byte UTF-8 encoding whose first (or only) byte is 0xxxxxxx, 110xxxxx, or 1110xxxx,
110-
// respectively, which is always less than 11110xxx when interpreted as a 2's-complement
111-
// unsigned integer.
97+
// Find the first differing character (a.k.a. "UTF-16 code unit") in the two strings and,
98+
// if found, use that character to determine the relative ordering of the two strings as a
99+
// whole. Comparing UTF-16 strings in UTF-8 byte order can be done simply and efficiently by
100+
// comparing the UTF-16 code units (chars). This serendipitously works because of the way UTF-8
101+
// and UTF-16 happen to represent Unicode code points.
102+
//
103+
// After finding the first pair of differing characters, there are two cases:
104+
//
105+
// Case 1: Both characters are non-surrogates (code points less than or equal to 0xFFFF) or
106+
// both are surrogates from a surrogate pair (that collectively represent code points greater
107+
// than 0xFFFF). In this case their numeric order as UTF-16 code units is the same as the
108+
// lexicographical order of their corresponding UTF-8 byte sequences. A direct comparison is
109+
// sufficient.
110+
//
111+
// Case 2: One character is a surrogate and the other is not. In this case the surrogate-
112+
// containing string is always ordered after the non-surrogate. This is because surrogates are
113+
// used to represent code points greater than 0xFFFF which have 4-byte UTF-8 representations
114+
// and are lexicographically greater than the 1, 2, or 3-byte representations of code points
115+
// less than or equal to 0xFFFF.
112116
final int length = Math.min(left.length(), right.length());
113117
for (int i = 0; i < length; i++) {
114118
final char leftChar = left.charAt(i);

0 commit comments

Comments
 (0)