Skip to content

Commit 4b9334e

Browse files
authored
Merge pull request #64787 from Azoy/document-scalar-array
[stdlib] Add some more documentation around the scalar arrays
2 parents 33727d7 + 04f6e1a commit 4b9334e

File tree

1 file changed

+108
-15
lines changed

1 file changed

+108
-15
lines changed

stdlib/public/stubs/Unicode/UnicodeData.cpp

Lines changed: 108 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -103,56 +103,149 @@ __swift_intptr_t _swift_stdlib_getMphIdx(__swift_uint32_t scalar,
103103
return resultIdx;
104104
}
105105

106+
// A scalar bit array is represented using a combination of quick look bit
107+
// arrays and specific bit arrays expanding these quick look arrays. There's
108+
// usually a few data structures accompanying these bit arrays like ranks, data
109+
// indices, and an actual data array.
110+
//
111+
// The bit arrays are constructed to look somewhat like the following:
112+
//
113+
// [quickLookSize, {uint64 * quickLookSize}, {5 * uint64}, {5 * uint64},
114+
// {5 * uint64}...]
115+
//
116+
// where the number of {5 * uint64} (a specific bit array) is equal to the
117+
// number of bits turned on within the {uint64 * quickLookSize}. This can be
118+
// easily calculated using the passed in ranks arrays who looks like the
119+
// following:
120+
//
121+
// [{uint16 * quickLookSize}, {5 * uint16}, {5 * uint16}, {5 * uint16}...]
122+
//
123+
// which is the same exact scheme as the bit arrays. Ranks contain the number of
124+
// previously turned on bits according their respectful {}. For instance, each
125+
// chunk, {5 * uint16}, begins with 0x0 and continuously grows as the number of
126+
// bits within the chunk turn on. An example sequence of this looks like:
127+
// [0x0, 0x0, 0x30, 0x70, 0xB0] where the first uint64 obviously doesn't have a
128+
// previous uint64 to look at, so its rank is 0. The second uint64's rank will
129+
// be the number of bits turned on in the first uint64, which in this case is
130+
// also 0. The third uint64's rank is 0x30 meaning there were 48 bits turned on
131+
// from the first uint64 through the second uint64.
106132
__swift_intptr_t _swift_stdlib_getScalarBitArrayIdx(__swift_uint32_t scalar,
107133
const __swift_uint64_t *bitArrays,
108134
const __swift_uint16_t *ranks) {
135+
// Chunk size indicates the number of scalars in a singular bit in our quick
136+
// look arrays. Currently, a chunk consists of 272 scalars being represented
137+
// in a bit. 0x110000 represents the maximum scalar value that Unicode will
138+
// never go over (or at least promised to never go over), 0x10FFFF, plus 1.
139+
// There are 64 bit arrays allocated for the quick look search and within
140+
// each bit array is an allocated 64 bits (8 bytes). Assuming the whole quick
141+
// search array is allocated and used, this would mean 512 bytes are used
142+
// solely for these arrays.
109143
auto chunkSize = 0x110000 / 64 / 64;
144+
145+
// Our base is the specific bit in the context of all of the bit arrays that
146+
// holds our scalar. Considering there are 64 bit arrays of 64 bits, that
147+
// would mean there are 64 * 64 = 4096 total bits to represent all scalars.
110148
auto base = scalar / chunkSize;
149+
150+
// Index is our specific bit array that holds our bit.
111151
auto idx = base / 64;
152+
153+
// Chunk bit is the specific bit within the bit array for our scalar.
112154
auto chunkBit = base % 64;
113-
155+
156+
// At the beginning our bit arrays is a number indicating the number of
157+
// actually implemented quick look bit arrays. We do this to save a little bit
158+
// of code size for bit arrays towards the end that usually contain no
159+
// properties, thus their bit arrays are most likely 0 or null.
114160
auto quickLookSize = bitArrays[0];
115-
161+
116162
// If our chunk index is larger than the quick look indices, then it means
117163
// our scalar appears in chunks who are all 0 and trailing.
118164
if ((__swift_uint64_t) idx > quickLookSize - 1) {
119165
return std::numeric_limits<__swift_intptr_t>::max();
120166
}
121-
167+
168+
// Our scalar actually exists in a quick look bit array that was implemented.
122169
auto quickLook = bitArrays[idx + 1];
123-
170+
171+
// If the quick look array has our chunk bit not set, that means all 272
172+
// (chunkSize) of the scalars being represented have no property and ours is
173+
// one of them.
124174
if ((quickLook & ((__swift_uint64_t) 1 << chunkBit)) == 0) {
125175
return std::numeric_limits<__swift_intptr_t>::max();
126176
}
127-
177+
128178
// Ok, our scalar failed the quick look check. Go lookup our scalar in the
129-
// chunk specific bit array.
179+
// chunk specific bit array. Ranks keeps track of the previous bit array's
180+
// number of non zero bits and is iterative.
181+
//
182+
// For example, [1, 3, 10] are bit arrays who have certain number of bits
183+
// turned on. The generated ranks array would look like [0, 1, 3] because
184+
// the first value, 1, does not have any previous bit array to look at so its
185+
// number of ranks are 0. 3 on the other hand will see its rank value as 1
186+
// because the previous value had 1 bit turned on. 10 will see 3 because it is
187+
// seeing both 1 and 3's number of turned on bits (3 has 2 bits on and
188+
// 1 + 2 = 3).
130189
auto chunkRank = ranks[idx];
131-
190+
191+
// If our specific bit within the chunk isn't the first bit, then count the
192+
// number of bits turned on preceeding our chunk bit.
132193
if (chunkBit != 0) {
133194
chunkRank += __builtin_popcountll(quickLook << (64 - chunkBit));
134195
}
135-
196+
197+
// Each bit that is turned on in the quick look arrays is given a bit array
198+
// that consists of 5 64 bit integers (5 * 64 = 320 which is enough to house
199+
// at least 272 specific bits dedicated to each scalar within a chunk). Our
200+
// specific chunk's array is located at:
201+
// 1 (quick look count)
202+
// +
203+
// quickLookSize (number of actually implemented quick look arrays)
204+
// +
205+
// chunkRank * 5 (where chunkRank is the total number of bits turned on
206+
// before ours and each chunk is given 5 uint64s)
136207
auto chunkBA = bitArrays + 1 + quickLookSize + (chunkRank * 5);
137-
208+
209+
// Our overall bit represents the bit within 0 - 271 (272 total, our
210+
// chunkSize) that houses our scalar.
138211
auto scalarOverallBit = scalar - (base * chunkSize);
212+
213+
// And our specific bit here represents the bit that houses our scalar inside
214+
// a specific uint64 in our overall bit array.
139215
auto scalarSpecificBit = scalarOverallBit % 64;
216+
217+
// Our word here is the index into the chunk's bit array to grab the specific
218+
// uint64 who houses a bit representing our scalar.
140219
auto scalarWord = scalarOverallBit / 64;
141-
220+
142221
auto chunkWord = chunkBA[scalarWord];
143-
144-
// If our scalar specifically is not turned on, then we're done.
222+
223+
// If our scalar specifically is not turned on within our chunk's bit array,
224+
// then we know for sure that our scalar does not inhibit this property.
145225
if ((chunkWord & ((__swift_uint64_t) 1 << scalarSpecificBit)) == 0) {
146226
return std::numeric_limits<__swift_intptr_t>::max();
147227
}
148-
228+
229+
// Otherwise, this scalar does have whatever property this scalar array is
230+
// representing. Our ranks also holds bit information for a chunk's bit array,
231+
// so each chunk is given 5 uint16 in our ranks to count its own bits.
149232
auto scalarRank = ranks[quickLookSize + (chunkRank * 5) + scalarWord];
150-
233+
234+
// Again, if our scalar isn't the first bit in its uint64, then count the
235+
// proceeding number of bits turned on in our uint64.
151236
if (scalarSpecificBit != 0) {
152237
scalarRank += __builtin_popcountll(chunkWord << (64 - scalarSpecificBit));
153238
}
154-
239+
240+
// In our last uint64 in our bit array, there is an index into our data index
241+
// array. Because we only need 272 bits for the scalars, any remaining bits
242+
// can be used for essentially whatever. 5 * 64 bits = 320 bits and we only
243+
// allocate 16 bits in the last uint64 for the remaining scalars
244+
// (4 * 64 bits = 256 + 16 = 272 (chunkSize)) leaving us with 48 spare bits.
155245
auto chunkDataIdx = chunkBA[4] >> 16;
156246

247+
// Finally, our index (or rather whatever value is stored in our spare bits)
248+
// is simply the start of our chunk's index plus the specific rank for our
249+
// scalar.
157250
return chunkDataIdx + scalarRank;
158251
}

0 commit comments

Comments
 (0)