|
| 1 | +# Advanced Heuristics and Contextual Understanding in JsonRemedy |
| 2 | + |
| 3 | +Building upon the probabilistic repair model, this document explores advanced heuristics and enhancements to contextual understanding. These ideas aim to further refine JsonRemedy's ability to make intelligent repair decisions, leading to more accurate and semantically correct JSON outputs. |
| 4 | + |
| 5 | +## Enriching `JsonContext` for Deeper Understanding |
| 6 | + |
| 7 | +The `JsonContext` is pivotal for nuanced repairs. Beyond the previously suggested `last_significant_char`, `last_token_type`, and `lookahead_buffer`, we can incorporate more sophisticated tracking: |
| 8 | + |
| 9 | +1. **N-gram Token History**: |
| 10 | + * Instead of just the `last_token_type`, maintain a short history (e.g., the last 2-3 tokens). `[:key, :colon, :string_value]` provides much more context than just `:string_value`. |
| 11 | + * This can help differentiate ambiguous situations. For example, a standalone number might be part of a list `[1, 2, 3]` or an error `{"key": 1 2}`. Token history can help assign costs. |
| 12 | + |
| 13 | +2. **Structural Depth and Type Stack**: |
| 14 | + * Maintain the current nesting `depth`. |
| 15 | + * Keep a `type_stack` (e.g., `[:object, :array, :object]`). This is more robust than just `current_type`. |
| 16 | + * This helps in validating structural integrity and applying repairs that are sensitive to nesting levels (e.g., maximum depth constraints, typical array/object patterns). |
| 17 | + |
| 18 | +3. **Key Duplication Tracking**: |
| 19 | + * Within an object context, keep a set of keys already encountered at the current nesting level. |
| 20 | + * This allows the system to assign a higher cost to repairs that would result in duplicate keys, or to automatically rename a duplicate key (e.g., `key_1`, `key_2`) with an associated cost. |
| 21 | + |
| 22 | +4. **Value Type Affinity**: |
| 23 | + * For arrays, observe the types of initial elements. If an array starts with `[1, 2, "abc", 3]`, the string `"abc"` might be an error. A heuristic could assign a cost to type inconsistencies within an array. |
| 24 | + * Similarly, if a key `age` consistently has integer values, encountering `{"age": "forty"}` might trigger a higher cost for keeping it as a string versus attempting a conversion or flagging. This borders on semantic understanding. |
| 25 | + |
| 26 | +5. **Whitespace and Comment Significance**: |
| 27 | + * Track if significant whitespace (e.g., multiple newlines) or comments separate tokens. This can sometimes indicate intended separation or grouping that typical JSON parsers ignore but might be relevant for repair heuristics. |
| 28 | + * *Example*: `{"key1": "value1"} |
| 29 | + |
| 30 | + {"key2": "value2"}` is more likely two objects needing to be wrapped in an array than `{"key1": "value1"}{"key2": "value2"}`. |
| 31 | + |
| 32 | +## Advanced Heuristics for the Declarative Rule Set |
| 33 | + |
| 34 | +The declarative rule set within `Layer3.SyntaxNormalization` (and potentially other layers) can be expanded with more sophisticated rules: |
| 35 | + |
| 36 | +1. **Context-Sensitive Auto-Correction of Common Typos**: |
| 37 | + * **Rule**: If an unquoted literal like `flase`, `ture`, `nill`, `Nnoe` appears in a value context. |
| 38 | + * **Repair**: Correct to `false`, `true`, `null`. |
| 39 | + * **Cost**: Low. |
| 40 | + * **Context**: `JsonContext` indicates it's a value position. |
| 41 | + |
| 42 | +2. **Intelligent Missing Comma/Colon Insertion**: |
| 43 | + * **Rule**: If `JsonContext.last_token_type` is `:string_value` and the next token is `:string_literal` (unquoted) in an object key context. |
| 44 | + * **Repair A**: Insert comma (treat as `value, new_key`). Cost: Medium. |
| 45 | + * **Repair B**: Insert colon (treat as `{"original_value_as_key": new_key}`). Cost: High. |
| 46 | + * The enriched context (N-gram token history) helps decide. If `last_tokens` were `[:key, :colon, :string_value]`, Repair A is more likely. |
| 47 | + |
| 48 | +3. **Handling of Concatenated JSON in Strings**: |
| 49 | + * **Rule**: A string value itself contains what appears to be a complete JSON object or array (e.g., `"{"inner_key": "inner_value"}"`). |
| 50 | + * **Repair A**: Keep as an escaped string (default). Cost: Low. |
| 51 | + * **Repair B**: Unescape and parse it as a nested structure. Cost: Medium-High (as it changes semantics). |
| 52 | + * **Condition**: This could be triggered by a user option or if the outer JSON structure is otherwise trivial (e.g., just one key-value pair). |
| 53 | + |
| 54 | +4. **Heuristics for Truncated Structures**: |
| 55 | + * **Rule**: Input ends abruptly while `JsonContext.type_stack` is not empty (e.g., `{"key": ["value1",` ). |
| 56 | + * **Repair**: Add appropriate closing delimiters (`]` and `}`). |
| 57 | + * **Cost**: Medium, increases with the number of delimiters to add. |
| 58 | + * **Refinement**: If `lookahead_buffer` (if reading from a stream) suggests more data might come, cost of closing could be higher, or it might generate a "wait/retry" candidate. |
| 59 | + |
| 60 | +5. **Repairing Numeric Value Errors**: |
| 61 | + * **Rule**: A number contains multiple decimal points (`1.2.3`) or misplaced commas (`1,234.56` not as thousands separators in some locales). |
| 62 | + * **Repair A**: Treat as string. Cost: Medium. |
| 63 | + * **Repair B**: Attempt to fix based on common patterns (e.g., keep first decimal, remove others). Cost: Medium-High. |
| 64 | + * **Repair C (for `1,234.56`):** If a "locale" or "number style" option is active, parse by removing group separators. Cost: Low-Medium. |
| 65 | + |
| 66 | +6. **Semantic Heuristics (More Experimental)**: |
| 67 | + * **Rule**: Key name suggests a type (e.g., `isActive`, `count`, `nameList`). |
| 68 | + * **Context**: `JsonContext` includes a (possibly configurable) dictionary of common key names and their expected value types (e.g., `isActive: boolean`, `count: number`, `nameList: array`). |
| 69 | + * **Repair**: If the actual value type mismatches, assign a higher cost to keeping it as is versus attempting a conversion or flagging. |
| 70 | + * *Example*: `{"isActive": "True_String"}`. Cost to keep as string is higher if `isActive` is known to expect boolean. Cost to convert "True_String" to `true` (if possible) is lower. |
| 71 | + * This is complex as it borders on schema validation/inference. |
| 72 | + |
| 73 | +7. **Balancing Repairs for Unmatched Delimiters**: |
| 74 | + * **Rule**: An unmatched closing delimiter (e.g., `}`) is found. |
| 75 | + * **Context**: `JsonContext.type_stack` shows the current open structure (e.g., `[:object, :array]`). |
| 76 | + * **Repair A**: Delete the unmatched delimiter. Cost: Medium. |
| 77 | + * **Repair B**: Insert corresponding opening delimiter(s) earlier in the text (if a plausible point can be found). Cost: High. |
| 78 | + * The beam search would explore both. If deleting the `}` leads to a valid parse with lower cost, it's preferred. If inserting `{[` earlier resolves more issues, that path might win. |
| 79 | + |
| 80 | +## Dynamic Heuristic Adjustment |
| 81 | + |
| 82 | +- **Feedback Loop**: If the `Validation` layer (Layer 4) frequently rejects candidates derived from a specific heuristic, the cost associated with that heuristic could be dynamically (or manually) increased. |
| 83 | +- **Source-Specific Profiles**: If JsonRemedy often processes data from a known source with idiosyncratic error patterns, users could define profiles that adjust costs for certain rules or enable source-specific heuristics. |
| 84 | + |
| 85 | +By combining a richer understanding of the JSON's context with a flexible set of advanced heuristics (each with a well-considered cost), JsonRemedy can significantly improve its ability to not just fix syntax but to infer the most probable *intended* structure of malformed JSON. |
0 commit comments