Skip to content

Description of string backreferences does not match jackson’s implementation. #24

@dan-robertson

Description

@dan-robertson

The specification gives a description of how to track string backreferences, which I found a little confusing. My understanding is that pseudocode should look like:

function read_key_name(stream, state):
  if stream.peek_byte() == 0x20:
    stream.advance(1)
    return ""
  else:
    switch parse_key_token(stream):
      case Backref(i) => return state.key_backreferences[i]
      case String(s) =>
        maybe_add_key_backreference(state,s)
        return s

function maybe_add_key_backreference(state, string):
  if string.length_bytes <= 64:
    if state.next_key_backref == 1024:
      state.next_key_backref = 0
      state.key_backreferences.clear()
    state.key_backreferences[state.next_key_backref] := string
    state.next_key_backref := state.next_key_backref + 1
  else:
    # do nothing because the string is not eligible
    return

Indeed, this is how this rust implementation interpreted the specification.

That is, a backreference of n refers to the nth non-backreference key of <= 64 bytes since the last reset, and we reset each 1024 non-duplicate property keys of length <= 64 bytes.

However, if one looks at jackson’s generator, key backreferences are saved for all non-empty strings.

So I think that, for property names (i.e. keys), there should be no notion of the ‘eligibility’ of keys (except for clarifying on eligibility empty strings, i.e. which of 0x20 and 0x34 0xfc¹ should be included in the backreference buffer?). I didn’t investigate what the behaviour is for shared strings (instead of shared property names).

I think it’s probably better to modify the spec to match jackson but maybe jackson should be changing instead. Certainly, I think it would be better if the spec were more precise.

¹ The specification specifies that 0x34 is followed by 64 or more bytes of string data however I think most parsers accept less and indeed encoding less than 64 bytes after a 0x34 is the only reasonable way to encode a unicode property name of 58-63 bytes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions