You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21Lines changed: 21 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,6 +122,27 @@ e.g. to take the existing 1kb JSON paramters, but also support 124-byte keys, us
122
122
123
123
If you are deriving a key to look up in-circuit and you do not know the maximum length of the key, all query methods have a version with a `_var` suffix (e.g. `JSON::get_string_var`), which accepts the key as a `BoundedVec`
124
124
125
+
# Architecture
126
+
### Overview
127
+
The JSON parser uses 5 steps to efficiently parse and index JSON data:
128
+
129
+
1.**build_transcript** - Convert raw bytes to a transcript of tokens using state machine defined by by JSON_CAPTURE_TABLE. Categorize each character as string, number, ...
130
+
2.**capture_missing_tokens & keyswap** - Fix missing tokens and correctly identify keys. Complete a second scan of the tokens, check for missing tokens (e.g.commas after literals), and for strings that are keys to an object, relabel them as keys,
131
+
3.**compute_json_packed** - Pack bytes into Field elements for efficient substring extraction
132
+
4.**create_json_entries** - Create structured JSON entries with parent-child relationships
133
+
5.**compute_keyhash_and_sort_json_entries** - Sort entries by key hash for efficient lookups
134
+
135
+
### Key Design Patterns
136
+
-**Using table lookups**: Uses many lookup tables to avoid branching logic to reduce circuit size
137
+
-**Packing data to Field elements**: Combines multiple fields that encodes different features into a single Field element for comparison
138
+
139
+
### Table Generation
140
+
The parser uses several lookup tables generated from `src/_table_generation/`:
141
+
-`TOKEN_FLAGS_TABLE`: State transitions for token processing
The JSON parser uses lookup tables to avoid branching logic and reduce gate count. These tables are generated from `src/_table_generation/make_tables.nr`.
5
+
6
+
## Generation Process
7
+
Tables are generated by simulating all possible input combinations from basic hardcoded tables and recording the expected outputs.
8
+
9
+
## TOKEN_FLAGS_TABLE
10
+
Maps (token, context) pairs to parsing flags:
11
+
-`create_json_entry`: Whether to create a JSON entry for this token, set to true if token is literal/number/string(not key)/end of array/object
12
+
-`is_end_of_object_or_array`: Whether token ends an object/array
13
+
-`is_start_of_object_or_array`: Whether token starts an object/array
14
+
-`new_context`: What context to switch to, object is 0, array is 1
15
+
-`is_key_token`: Whether token is a key
16
+
-`is_value_token`: Whether token is a value, set to True for string_token, numeric_token, and literal_token
17
+
-`preserve_num_entries`: boolean flag that controls whether the current token should preserve the existing count of entries at the current depth or reset/increment it. 1 for tokens like NO_TOKEN, KEY_TOKEN, STRING_TOKEN, NUMERIC_TOKEN, LITERAL_TOKEN
18
+
0 for tokens like OBJECT_START_TOKEN, ARRAY_START_TOKEN, OBJECT_END_TOKEN, ARRAY_END_TOKEN
19
+
20
+
## JSON_CAPTURE_TABLE
21
+
Maps (escape_flag, scan_mode, ascii) to scanning actions:
22
+
-`scan_token`: Next capture mode based on current capture mode, can be grammar_capture([,{,comma,},],:)/string_capture/literal_capture/numeric_capture/error_capture. For example, if currently we are in string capture, and character is ", then scan_token will be set to grammar_capture because we are at end of string, back to grammar scan. If we are in numeric scan, and current character is not 0-9, then we are back to grammar scan as we expect the number has ended.
23
+
-`push_transcript`: Whether to add token to transcript: in grammar mode: true for all structual elements[,{,comma,},],:. In string_capture, true for ", which signals string end. In numeric/literal_capture, true for space, \t, \n, \r, ", and comma. Note the first scan will not pick up numerics or literals because we don't know when they end, so we need to rely on capture_missing_tokens function.
24
+
-`increase_length`: Whether to extend current token, always false for grammar_capture, true for 0-9 in numeric capture, all characters except for " in string_capture, all letters in true, false, null in literal_capture
25
+
-`is_potential_escape_sequence`: true if current token is / in string_capture mode
0 commit comments