|
53 | 53 | // speaking, words. They're just spans of code points that frequently
|
54 | 54 | // occur together. They are ordered shortest to longest.
|
55 | 55 | //
|
| 56 | +// - If the translation uses a lot of code points or widely spaced code points, |
| 57 | +// then the huffman table entries are UTF-16 code points. But if the translation |
| 58 | +// uses only ASCII 7-bit code points plus a SMALL range of higher code points that |
| 59 | +// still fit in 8 bits, translation_offset and translation_offstart are used to |
| 60 | +// renumber the code points so that they still fit within 8 bits. (it's very beneficial |
| 61 | +// for mchar_t to be 8 bits instead of 16!) |
| 62 | +// |
56 | 63 | // - dictionary entries are non-overlapping, and the _ending_ index of each
|
57 | 64 | // entry is stored in an array. A count of words of each length, from
|
58 | 65 | // minlen to maxlen, is given in the array called wlencount. From
|
59 | 66 | // this small array, the start and end of the N'th word can be
|
60 | 67 | // calculated by an efficient, small loop. (A bit of time is traded
|
61 | 68 | // to reduce the size of this table indicating lengths)
|
62 | 69 | //
|
| 70 | +// - Value 1 ('\1') is used to indicate that a QSTR number follows. the |
| 71 | +// QSTR is encoded as a fixed number of bits (translation_qstr_bits), e.g., |
| 72 | +// 10 bits if the highest core qstr is from 512 to 1023 inclusive. |
| 73 | +// (maketranslationdata uses a simple heuristic where any qstr >= 3 |
| 74 | +// characters long is encoded in this way; this is simple but probably not |
| 75 | +// optimal. In fact, the rule of >= 2 characters is better for SOME languages |
| 76 | +// on SOME boards.) |
| 77 | +// |
63 | 78 | // The "data" / "tail" construct is so that the struct's last member is a
|
64 | 79 | // "flexible array". However, the _only_ member is not permitted to be
|
65 | 80 | // a flexible member, so we have to declare the first byte as a separate
|
|
0 commit comments