Skip to content

A model for strings #1

@Yoric

Description

@Yoric

I've had an idea of a simple model/encoding for strings, I'd like to test it at some point.

It starts from the remark that any string we encounter more than once may be represented either:

  • as an index from start;
  • as an index from latest string;
  • if the string is part of the AOT dictionary, as an index in that dictionary.

As discussed, some strings tend to have many instances in a given window, while others don't. I suspect that, by picking the best of these three representation, we'll be able to reduce the size.

We represent this as the following alphabet:

enum Symbol {
  /// Well-known string, stored in the dictionary we shipped with the encoder/decoder.
  BuiltInDictionary(usize),

  /// A string already referenced in this file, as indexed from the start.
  /// 0 is the first string encountered in the file, 1 the second, ...
  FromStart(usize),

  /// A string already referenced in this file, as indexed from the current position.
  /// 0 is the latest string encountered in the file, 1 the previous, ...
  FromCurrent(usize),

  /// A new string, never before encountered.
  /// Must be followed by a literal string.
  New
}

Now, whenever we encounter a string, we add it to the following in-memory tables. Both tables will let us find how to represent a string, when we next encounter it, using either Symbol::FromStart and Symbol::FromCurrent:

pub struct State {
  /// First index of a given string. When we use this, we try and keep numbers small.
  first: HashMap<Rc<String>, usize>,

  /// Latest index of a given string.
  latest: HashMap<Rc<String>, usize>,

  // ...
};

We then add statistics, to find out which is best representation of a string

pub struct State {
  // ...

  /// A mapping from `index` to number of times we have used
  /// `Symbol::BuiltinDictionary(index)`.
  frequency_built_in: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromStart(index)`.
  frequency_from_start: VecMap<usize>,

  /// A mapping from `index` to number of times we have used
  /// `Symbol::FromCurrent(index)`.
  frequency_from_latest: VecMap<usize>,
}

With these two pieces of information (first/latest and frequency_*), we may find, for each string, the most common symbol we may use to represent it.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions