add explanation for newer compression features

jepler · jepler · commit 5c23e2820864 · 2023-08-31T13:27:16.000-05:00
diff --git a/supervisor/shared/translate/compressed_string.h b/supervisor/shared/translate/compressed_string.h
@@ -53,13 +53,28 @@
 //   speaking, words.  They're just spans of code points that frequently
 //   occur together.  They are ordered shortest to longest.
 //
+// - If the translation uses a lot of code points or widely spaced code points,
+//   then the huffman table entries are UTF-16 code points. But if the translation
+//   uses only ASCII 7-bit code points plus a SMALL range of higher code points that
+//   still fit in 8 bits, translation_offset and translation_offstart are used to
+//   renumber the code points so that they still fit within 8 bits. (it's very beneficial
+//   for mchar_t to be 8 bits instead of 16!)
+//
 // - dictionary entries are non-overlapping, and the _ending_ index of each
 //   entry is stored in an array.  A count of words of each length, from
 //   minlen to maxlen, is given in the array called wlencount.  From
 //   this small array, the start and end of the N'th word can be
 //   calculated by an efficient, small loop.  (A bit of time is traded
 //   to reduce the size of this table indicating lengths)
 //
+// - Value 1 ('\1') is used to indicate that a QSTR number follows. the
+//   QSTR is encoded as a fixed number of bits (translation_qstr_bits), e.g.,
+//   10 bits if the highest core qstr is from 512 to 1023 inclusive.
+//   (maketranslationdata uses a simple heuristic where any qstr >= 3
+//   characters long is encoded in this way; this is simple but probably not
+//   optimal. In fact, the rule of >= 2 characters is better for SOME languages
+//   on SOME boards.)
+//
 // The "data" / "tail" construct is so that the struct's last member is a
 // "flexible array".  However, the _only_ member is not permitted to be
 // a flexible member, so we have to declare the first byte as a separate