-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.
It starts from the remark that any string we encounter more than once may be represented either:
- as an index from start;
- as an index from latest string;
- if the string is part of the AOT dictionary, as an index in that dictionary.
As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.
We represent this as the following alphabet:
enum Symbol {
/// Well-known string, stored in the dictionary we shipped with the encoder/decoder.
BuiltInDictionary(usize),
/// A string already referenced in this file, as indexed from the start.
/// 0 is the first string encountered in the file, 1 the second, ...
FromStart(usize),
/// A string already referenced in this file, as indexed from the current position.
/// 0 is the latest string encountered in the file, 1 the previous, ...
FromCurrent(usize),
/// A new string, never before encountered.
/// Must be followed by a literal string.
New
}Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either Symbol::FromStart and Symbol::FromCurrent:
pub struct State {
/// First index of a given string. When we use this, we try and keep numbers small.
first: HashMap<Rc<String>, usize>,
/// Latest index of a given string.
latest: HashMap<Rc<String>, usize>,
// ...
};We then add statistics, to find out which is best representation of a string
pub struct State {
// ...
/// A mapping from `index` to number of times we have used
/// `Symbol::BuiltinDictionary(index)`.
frequency_built_in: VecMap<usize>,
/// A mapping from `index` to number of times we have used
/// `Symbol::FromStart(index)`.
frequency_from_start: VecMap<usize>,
/// A mapping from `index` to number of times we have used
/// `Symbol::FromCurrent(index)`.
frequency_from_latest: VecMap<usize>,
}With these two pieces of information (first/latest and frequency_*), we may find, for each string, the most common symbol we may use to represent it.