Skip to content

chore: fix JSONEntry Eq and add documentation #61

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,64 @@ e.g. to take the existing 1kb JSON paramters, but also support 124-byte keys, us

If you are deriving a key to look up in-circuit and you do not know the maximum length of the key, all query methods have a version with a `_var` suffix (e.g. `JSON::get_string_var`), which accepts the key as a `BoundedVec`

# Architecture
### Overview
The JSON parser uses 5 steps to efficiently parse and index JSON data:

1. **build_transcript** - Convert raw bytes to a transcript of tokens using state machine defined by by JSON_CAPTURE_TABLE. Categorize each character as string, number, ...
2. **capture_missing_tokens & keyswap** - Fix missing tokens and correctly identify keys. Complete a second scan of the tokens, check for missing tokens (e.g.commas after literals), and for strings that are keys to an object, relabel them as keys,
3. **compute_json_packed** - Pack bytes into Field elements for efficient substring extraction
4. **create_json_entries** - Create structured JSON entries with parent-child relationships
5. **compute_keyhash_and_sort_json_entries** - Sort entries by key hash for efficient lookups

### Key Design Patterns
- **Using table lookups**: Uses many lookup tables to avoid branching logic to reduce circuit size
- **Packing data to Field elements**: Combines multiple fields that encodes different features into a single Field element for comparison

### Table Generation
The parser uses several lookup tables generated from `src/_table_generation/`:
- `TOKEN_FLAGS_TABLE`: State transitions for token processing
- `JSON_CAPTURE_TABLE`: Character-by-character parsing rules
- `TOKEN_VALIDATION_TABLE`: JSON grammar validation

### Example walkthrough
We can take a look at raw Json text {"name": "Alice", "age": 30} and how it is being parsed.
First, The parser reads the JSON one character at a time and uses lookup tables to decide what to do with each character. For {"name": "Alice"},
Character: { → "Start scanning an object (grammar_capture)"
Character: " → "Start scanning a string"
Character: n → "Continue scanning the string"
Character: a → "Continue scanning the string"
Character: m → "Continue scanning the string"
Character: e → "Continue scanning the string"
Character: " → "End the string"
Character: : → "Key-value separator"
Character: " → "Start scanning a string"
Character: A → "Continue scanning the string"
Character: l → "Continue scanning the string"
Character: i → "Continue scanning the string"
Character: c → "Continue scanning the string"
Character: e → "Continue scanning the string"
Character: " → "End the string"
Character: } → "End the object"

The parser builds a list of "tokens", the basic building blocks of the JSON, which becomes
1. BEGIN_OBJECT_TOKEN ({)
2. STRING_TOKEN ("name")
3. KEY_SEPARATOR_TOKEN (:)
4. STRING_TOKEN ("Alice")
5. END_OBJECT_TOKEN (})

The parser converts tokens into structured entries with parent-child relationships.
Each entry knows:
What type it is (object, string, number, etc.)
Who its parent is
How many children it has
Where it is in the original JSON

Finally, the parser sorts entries by their key hashes for fast lookups.
Original order: [{"name": "Alice"}, {"age": 30}]
Sorted order: [{"age": 30}, {"name": "Alice"}]

# Acknowledgements

Many thanks to the authors of the OG noir json library https://github.com/RontoSOFT/noir-json-parser
71 changes: 71 additions & 0 deletions src/_table_generation/table_generation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Table Generation Documentation

## Overview
The JSON parser uses lookup tables to avoid branching logic and reduce gate count. These tables are generated from `src/_table_generation/make_tables.nr`.

## Generation Process
Tables are generated by simulating all possible input combinations from basic hardcoded tables and recording the expected outputs.

## TOKEN_FLAGS_TABLE
Maps (token, context) pairs to parsing flags:
- `create_json_entry`: Whether to create a JSON entry for this token, set to true if token is literal/number/string(not key)/end of array/object
- `is_end_of_object_or_array`: Whether token ends an object/array
- `is_start_of_object_or_array`: Whether token starts an object/array
- `new_context`: What context to switch to, object is 0, array is 1
- `is_key_token`: Whether token is a key
- `is_value_token`: Whether token is a value, set to True for string_token, numeric_token, and literal_token
- `preserve_num_entries`: boolean flag that controls whether the current token should preserve the existing count of entries at the current depth or reset/increment it. 1 for tokens like NO_TOKEN, KEY_TOKEN, STRING_TOKEN, NUMERIC_TOKEN, LITERAL_TOKEN
0 for tokens like OBJECT_START_TOKEN, ARRAY_START_TOKEN, OBJECT_END_TOKEN, ARRAY_END_TOKEN

## JSON_CAPTURE_TABLE
Maps (escape_flag, scan_mode, ascii) to scanning actions:
- `scan_token`: Next capture mode based on current capture mode, can be grammar_capture([,{,comma,},],:)/string_capture/literal_capture/numeric_capture/error_capture. For example, if currently we are in string capture, and character is ", then scan_token will be set to grammar_capture because we are at end of string, back to grammar scan. If we are in numeric scan, and current character is not 0-9, then we are back to grammar scan as we expect the number has ended.
- `push_transcript`: Whether to add token to transcript: in grammar mode: true for all structual elements[,{,comma,},],:. In string_capture, true for ", which signals string end. In numeric/literal_capture, true for space, \t, \n, \r, ", and comma. Note the first scan will not pick up numerics or literals because we don't know when they end, so we need to rely on capture_missing_tokens function.
- `increase_length`: Whether to extend current token, always false for grammar_capture, true for 0-9 in numeric capture, all characters except for " in string_capture, all letters in true, false, null in literal_capture
- `is_potential_escape_sequence`: true if current token is / in string_capture mode

## Other tables
While TOKEN_FLAGS_TABLE and JSON_CAPTURE_TABLE are the more important tables, they are built from foundational hardcoded tables in make_tables_subtables.nr:

GRAMMAR_CAPTURE_TABLE: State transition table for grammar scan mode. Each entry specifies the next scan mode (GRAMMAR_CAPTURE, STRING_CAPTURE, NUMERIC_CAPTURE, LITERAL_CAPTURE, or ERROR_CAPTURE) based on the encountered ASCII character. For example, "f" is mapped to LITEAL_CAPTURE because it indicates we began to scan the literal false.
STRING_CAPTURE_TABLE
NUMERIC_CAPTURE_TABLE
LITERAL_CAPTURE_TABLE

GRAMMAR_CAPTURE_TOKEN: Maps characters in grammar mode to token types. Converts ASCII characters into the appropriate JSON token types for structural elements, values, and literals.
Structural characters ({, }, [, ], ,, :) → their respective structural tokens
- Quote (") → STRING_TOKEN (start of string)
- Digits (0-9) → NUMERIC_TOKEN (start of number)
- Literal starters (f, t, n) → LITERAL_TOKEN (start of true/false/null)
- Invalid characters → NO_TOKEN or error handling
STRING_CAPTURE_TOKEN
NUMERIC_CAPTURE_TOKEN
LITERAL_CAPTURE_TOKEN

STRING_CAPTURE_PUSH_TRANSCRIPT: Determines when to add tokens to the transcript while scanning inside a string. Only true for the closing quote ("). This signals the end of the string and triggers token creation. All other characters within the string (letters, numbers, punctuation, spaces) are false because they extend the current string token rather than creating new tokens.

GRAMMAR_CAPTURE_PUSH_TRANSCRIPT: Determines when to add tokens to the transcript while scanning in grammar mode. True for the following characters:
- Comma (,) → true (value separator)
- Colon (:) → true (key-value separator)
- All other characters → false (including digits, quotes, and literal starters)

NUMERIC_CAPTURE_PUSH_TRANSCRIPT: Determines when to add the current numeric token to the transcript while scanning a number. True for the following characters:
- Whitespace (space, tab, newline, carriage return) → true (end number)
- Quote (") → true (end number, followed by string)
- Comma (,) → true (end number, followed by next value)
- All other characters → false (extend current number or error)

LITERAL_CAPTURE_PUSH_TRANSCRIPT: Determines when to add the current literal token (true/false/null) to the transcript while scanning a literal. True for any grammar character: , [ ] { } " space tab newline (This is only used in the first scan, in the second step capture_missing_tokens, we will be able to separate the literal and value separator)

GRAMMAR_CAPTURE_INCREASE_LENGTH: Determines when to extend the current token length while scanning in grammar mode. True for Digits (0-9) -> starting numeric scan, Letters for literals (f, t, n, r, u, e, a, l, s) -> starting literal scan. For structural tokens, we don't count its length (is just 1). For string tokens, we are expecting to see a " first before seeing letters.

STRING_CAPTURE_INCREASE_LENGTH: Determines when to extend the current string token while scanning inside a string. True for all printable characters except for Quote (ends the string)
NUMERIC_CAPTURE_INCREASE_LENGTH: True for 0-9
LITERAL_CAPTURE_INCREASE_LENGTH: True for t,r,u,e,f,a,l,s,n

GRAMMAR_CAPTURE_ERROR_FLAG
STRING_CAPTURE_ERROR_FLAG
NUMERIC_CAPTURE_ERROR_FLAG
LITERAL_CAPTURE_ERROR_FLAG

PROCESS_RAW_TRANSCRIPT_TABLE: This table is used to post-process the raw transcript and add missing grammar tokens that were not captured during the initial scanning in build_transcript. Input: encoded_ascii of the last token in each entry (scan_mode + ascii character). Output: containing: token: The token type for this entry, new_grammar: Whether to add a missing grammar token, and scan_token: The type of grammar token to add (if needed), such as END_OBJECT_TOKEN }, or VALUE_SEPARATOR_TOKEN comma.
Loading
Loading