diff --git a/README.md b/README.md index d9ff043..088402d 100644 --- a/README.md +++ b/README.md @@ -122,6 +122,64 @@ e.g. to take the existing 1kb JSON paramters, but also support 124-byte keys, us If you are deriving a key to look up in-circuit and you do not know the maximum length of the key, all query methods have a version with a `_var` suffix (e.g. `JSON::get_string_var`), which accepts the key as a `BoundedVec` +# Architecture +### Overview +The JSON parser uses 5 steps to efficiently parse and index JSON data: + +1. **build_transcript** - Convert raw bytes to a transcript of tokens using state machine defined by by JSON_CAPTURE_TABLE. Categorize each character as string, number, ... +2. **capture_missing_tokens & keyswap** - Fix missing tokens and correctly identify keys. Complete a second scan of the tokens, check for missing tokens (e.g.commas after literals), and for strings that are keys to an object, relabel them as keys, +3. **compute_json_packed** - Pack bytes into Field elements for efficient substring extraction +4. **create_json_entries** - Create structured JSON entries with parent-child relationships +5. **compute_keyhash_and_sort_json_entries** - Sort entries by key hash for efficient lookups + +### Key Design Patterns +- **Using table lookups**: Uses many lookup tables to avoid branching logic to reduce circuit size +- **Packing data to Field elements**: Combines multiple fields that encodes different features into a single Field element for comparison + +### Table Generation +The parser uses several lookup tables generated from `src/_table_generation/`: +- `TOKEN_FLAGS_TABLE`: State transitions for token processing +- `JSON_CAPTURE_TABLE`: Character-by-character parsing rules +- `TOKEN_VALIDATION_TABLE`: JSON grammar validation + +### Example walkthrough +We can take a look at raw Json text {"name": "Alice", "age": 30} and how it is being parsed. +First, The parser reads the JSON one character at a time and uses lookup tables to decide what to do with each character. For {"name": "Alice"}, +Character: { → "Start scanning an object (grammar_capture)" +Character: " → "Start scanning a string" +Character: n → "Continue scanning the string" +Character: a → "Continue scanning the string" +Character: m → "Continue scanning the string" +Character: e → "Continue scanning the string" +Character: " → "End the string" +Character: : → "Key-value separator" +Character: " → "Start scanning a string" +Character: A → "Continue scanning the string" +Character: l → "Continue scanning the string" +Character: i → "Continue scanning the string" +Character: c → "Continue scanning the string" +Character: e → "Continue scanning the string" +Character: " → "End the string" +Character: } → "End the object" + +The parser builds a list of "tokens", the basic building blocks of the JSON, which becomes +1. BEGIN_OBJECT_TOKEN ({) +2. STRING_TOKEN ("name") +3. KEY_SEPARATOR_TOKEN (:) +4. STRING_TOKEN ("Alice") +5. END_OBJECT_TOKEN (}) + +The parser converts tokens into structured entries with parent-child relationships. +Each entry knows: +What type it is (object, string, number, etc.) +Who its parent is +How many children it has +Where it is in the original JSON + +Finally, the parser sorts entries by their key hashes for fast lookups. +Original order: [{"name": "Alice"}, {"age": 30}] +Sorted order: [{"age": 30}, {"name": "Alice"}] + # Acknowledgements Many thanks to the authors of the OG noir json library https://github.com/RontoSOFT/noir-json-parser diff --git a/src/_table_generation/table_generation.md b/src/_table_generation/table_generation.md new file mode 100644 index 0000000..a3b7753 --- /dev/null +++ b/src/_table_generation/table_generation.md @@ -0,0 +1,71 @@ +# Table Generation Documentation + +## Overview +The JSON parser uses lookup tables to avoid branching logic and reduce gate count. These tables are generated from `src/_table_generation/make_tables.nr`. + +## Generation Process +Tables are generated by simulating all possible input combinations from basic hardcoded tables and recording the expected outputs. + +## TOKEN_FLAGS_TABLE +Maps (token, context) pairs to parsing flags: +- `create_json_entry`: Whether to create a JSON entry for this token, set to true if token is literal/number/string(not key)/end of array/object +- `is_end_of_object_or_array`: Whether token ends an object/array +- `is_start_of_object_or_array`: Whether token starts an object/array +- `new_context`: What context to switch to, object is 0, array is 1 +- `is_key_token`: Whether token is a key +- `is_value_token`: Whether token is a value, set to True for string_token, numeric_token, and literal_token +- `preserve_num_entries`: boolean flag that controls whether the current token should preserve the existing count of entries at the current depth or reset/increment it. 1 for tokens like NO_TOKEN, KEY_TOKEN, STRING_TOKEN, NUMERIC_TOKEN, LITERAL_TOKEN +0 for tokens like OBJECT_START_TOKEN, ARRAY_START_TOKEN, OBJECT_END_TOKEN, ARRAY_END_TOKEN + +## JSON_CAPTURE_TABLE +Maps (escape_flag, scan_mode, ascii) to scanning actions: +- `scan_token`: Next capture mode based on current capture mode, can be grammar_capture([,{,comma,},],:)/string_capture/literal_capture/numeric_capture/error_capture. For example, if currently we are in string capture, and character is ", then scan_token will be set to grammar_capture because we are at end of string, back to grammar scan. If we are in numeric scan, and current character is not 0-9, then we are back to grammar scan as we expect the number has ended. +- `push_transcript`: Whether to add token to transcript: in grammar mode: true for all structual elements[,{,comma,},],:. In string_capture, true for ", which signals string end. In numeric/literal_capture, true for space, \t, \n, \r, ", and comma. Note the first scan will not pick up numerics or literals because we don't know when they end, so we need to rely on capture_missing_tokens function. +- `increase_length`: Whether to extend current token, always false for grammar_capture, true for 0-9 in numeric capture, all characters except for " in string_capture, all letters in true, false, null in literal_capture +- `is_potential_escape_sequence`: true if current token is / in string_capture mode + +## Other tables +While TOKEN_FLAGS_TABLE and JSON_CAPTURE_TABLE are the more important tables, they are built from foundational hardcoded tables in make_tables_subtables.nr: + +GRAMMAR_CAPTURE_TABLE: State transition table for grammar scan mode. Each entry specifies the next scan mode (GRAMMAR_CAPTURE, STRING_CAPTURE, NUMERIC_CAPTURE, LITERAL_CAPTURE, or ERROR_CAPTURE) based on the encountered ASCII character. For example, "f" is mapped to LITEAL_CAPTURE because it indicates we began to scan the literal false. +STRING_CAPTURE_TABLE +NUMERIC_CAPTURE_TABLE +LITERAL_CAPTURE_TABLE + +GRAMMAR_CAPTURE_TOKEN: Maps characters in grammar mode to token types. Converts ASCII characters into the appropriate JSON token types for structural elements, values, and literals. + Structural characters ({, }, [, ], ,, :) → their respective structural tokens +- Quote (") → STRING_TOKEN (start of string) +- Digits (0-9) → NUMERIC_TOKEN (start of number) +- Literal starters (f, t, n) → LITERAL_TOKEN (start of true/false/null) +- Invalid characters → NO_TOKEN or error handling +STRING_CAPTURE_TOKEN +NUMERIC_CAPTURE_TOKEN +LITERAL_CAPTURE_TOKEN + +STRING_CAPTURE_PUSH_TRANSCRIPT: Determines when to add tokens to the transcript while scanning inside a string. Only true for the closing quote ("). This signals the end of the string and triggers token creation. All other characters within the string (letters, numbers, punctuation, spaces) are false because they extend the current string token rather than creating new tokens. + +GRAMMAR_CAPTURE_PUSH_TRANSCRIPT: Determines when to add tokens to the transcript while scanning in grammar mode. True for the following characters: +- Comma (,) → true (value separator) +- Colon (:) → true (key-value separator) +- All other characters → false (including digits, quotes, and literal starters) + +NUMERIC_CAPTURE_PUSH_TRANSCRIPT: Determines when to add the current numeric token to the transcript while scanning a number. True for the following characters: +- Whitespace (space, tab, newline, carriage return) → true (end number) +- Quote (") → true (end number, followed by string) +- Comma (,) → true (end number, followed by next value) +- All other characters → false (extend current number or error) + +LITERAL_CAPTURE_PUSH_TRANSCRIPT: Determines when to add the current literal token (true/false/null) to the transcript while scanning a literal. True for any grammar character: , [ ] { } " space tab newline (This is only used in the first scan, in the second step capture_missing_tokens, we will be able to separate the literal and value separator) + +GRAMMAR_CAPTURE_INCREASE_LENGTH: Determines when to extend the current token length while scanning in grammar mode. True for Digits (0-9) -> starting numeric scan, Letters for literals (f, t, n, r, u, e, a, l, s) -> starting literal scan. For structural tokens, we don't count its length (is just 1). For string tokens, we are expecting to see a " first before seeing letters. + +STRING_CAPTURE_INCREASE_LENGTH: Determines when to extend the current string token while scanning inside a string. True for all printable characters except for Quote (ends the string) +NUMERIC_CAPTURE_INCREASE_LENGTH: True for 0-9 +LITERAL_CAPTURE_INCREASE_LENGTH: True for t,r,u,e,f,a,l,s,n + +GRAMMAR_CAPTURE_ERROR_FLAG +STRING_CAPTURE_ERROR_FLAG +NUMERIC_CAPTURE_ERROR_FLAG +LITERAL_CAPTURE_ERROR_FLAG + +PROCESS_RAW_TRANSCRIPT_TABLE: This table is used to post-process the raw transcript and add missing grammar tokens that were not captured during the initial scanning in build_transcript. Input: encoded_ascii of the last token in each entry (scan_mode + ascii character). Output: containing: token: The token type for this entry, new_grammar: Whether to add a missing grammar token, and scan_token: The type of grammar token to add (if needed), such as END_OBJECT_TOKEN }, or VALUE_SEPARATOR_TOKEN comma. \ No newline at end of file diff --git a/src/json.nr b/src/json.nr index a39b7b7..754165e 100644 --- a/src/json.nr +++ b/src/json.nr @@ -77,8 +77,7 @@ impl(); +unconstrained fn __check_entry_ptr_bounds(entry_ptr: u32, max: u32) { // n.b. even though this assert is in an unconstrained function, an out of bounds error will be triggered when writing into self.key_data[entry_ptr] assert(entry_ptr as u32 < max - 1, "create_json_entries: MaxNumValues limit exceeded!"); } @@ -117,6 +116,7 @@ impl `preserve_num_entries = 0` (start new object/array) + // - When `is_end_of_object_or_array = 1` -> `preserve_num_entries = 0` (end object/array) + // - When `preserve_num_entries = 1` -> both flags = 0 (normal token) + // + // 4 gates + { + let old = current_identity_value; + current_identity_value = (next_identity_value * is_start_of_object_or_array); + std::as_witness(current_identity_value); + current_identity_value = current_identity_value + + (previous_stack_entry.current_identity * is_end_of_object_or_array); + std::as_witness(current_identity_value); + current_identity_value = current_identity_value + old * preserve_num_entries; + std::as_witness(current_identity_value); + // If the current token creates an object or array, subsequent entries will be a child of this object + // i.e. we need to assign them a new identifier so increase `next_identity_value` + next_identity_value = next_identity_value + is_start_of_object_or_array; + std::as_witness(next_identity_value); + } + // Update the number of entries in the parent object/array + // Pseudocode: + // if (!preserve_num_entries && is_value_token) { + // num_entries_at_current_depth += 1; + // } else if (is_end_of_object_or_array) { + // num_entries_at_current_depth = previous_stack_entry.num_entries + 1; + // } // 2 gates + // If we ses a value token (string/number/literal), we add 1 to count. If we see , or :, no change. + // If preserve_num_entries is 0 (i.e. start or end of object or array) then we reset variable to 0. num_entries_at_current_depth = num_entries_at_current_depth * preserve_num_entries + is_value_token; std::as_witness(num_entries_at_current_depth); @@ -364,13 +485,45 @@ impl object, context == 1 => array) + // If current token is END_OBJECT_TOKEN or END_ARRAY_TOKEN, set context to the context value in previous_stack_entry + // (i.e. restore the context to whatever the parent of the object/array is) + // Pseudocode: + // if (is_end_of_object_or_array) { + // context = previous_stack_entry.context + // } else { + // context = new_context + // } // 1 gate // if `is_end_of_object_or_array == 1`, `new_context = 0` so we can do something cheaper than a conditional select: + // If is_end_of_object_or_array is 1, then new_context is 0, so set context = previous_stack_entry.context + // If is_end_of_object_or_array is 0, then set context = new_context context = cast_num_to_u32( previous_stack_entry.context * is_end_of_object_or_array + new_context, ); std::as_witness(context as Field); + + // Update data that describes the key for the current token. + // If we are creating a JSON entry, we also populate `self.key_data` with info that describes the current entry's key + // key_data contains 3 members that are packed into a Field: + // * the key index (where in the original JSON blob does the key start?) + // * the key length (length of the key in bytes) + // * current_identity_value (unique identifier for the key's JSON object. starts at 0) + // * in the current parent object/array, how many JSON entries deep is the key's associated JSON object? + // TODO: would be much more readable if we have a custom struct `KeyData` that wrapped a Field elemenet with sensible helper methods + // Pseudocode: + // if (create_json_entry) { + // let mut new_key_data; + // if (is_value_token) { + // new_key_data = make_key(current_key_index_and_length, current_identity_value, num_entries_at_current_depth - 1); + // } else if (is_end_of_object_or_array) { + // new_key_data = make_key(previous_stack_entry.current_key_index_and_length, current_identity_value, num_entries_at_current_depth - 1); + // } + // self.key_data[entry_ptr] = new_key_data; + // } // 3 gates + // If context is 0 (object context), then don't take the num_entries_at_current_depth term into account + // because searching for a key only depends of the key name, not position, as opposed to array context where we need to look up by position/index. let common_term = current_identity_value + context as Field * (num_entries_at_current_depth - 1) * 0x1000000000000; std::as_witness(common_term); @@ -382,29 +535,14 @@ impl(self.json) }; + // steps to verify the transcript is correct // 14 gates per iteration, plus fixed cost for initing 2,048 size lookup table (4,096 gates) let mut previous_was_potential_escape_sequence: bool = false; for i in 0..NumBytes { let ascii = self.json[i]; // 1 gate - let encoded_ascii = previous_was_potential_escape_sequence as Field * 1024 + let encoded_ascii: Field = previous_was_potential_escape_sequence as Field * 1024 + scan_mode * 256 + ascii as Field; std::as_witness(encoded_ascii); @@ -459,10 +598,12 @@ impl(); + scan_mode.assert_max_bit_size::<2>(); JSON { json: self.json, @@ -507,7 +648,8 @@ impl Self { let bytes: [u8; 7] = f.to_be_bytes(); let create_json_entry = bytes[0] != 0; @@ -31,17 +51,18 @@ impl TokenFlags { } } + /// Convert a Field element that contains a packed TokenFlags object into a real TokenFlags object pub(crate) fn from_field(f: Field) -> Self { // 10 gates // Safety: check the comments below let r = unsafe { TokenFlags::__from_field(f) }; - // asserts the relation of r and f assert(r.to_field() == f); r } - // 4 gates + /// Pack a TokenFlags object into a Field element + /// 4 gates pub(crate) fn to_field(self) -> Field { (self.preserve_num_entries as Field) + (self.is_value_token as Field) * 0x100