Add documentation for innovative JSON repair ideas

google-labs-jules[bot] · google-labs-jules[bot] · commit 0fe7e833e648 · 2025-06-10T04:49:01.000Z
This commit introduces several new markdown files in the root directory,
expanding on innovative concepts for the JsonRemedy library, largely
inspired by the existing `docs/design/1.md`.

The new files are:

- `INNOVATIONS_IN_JSON_REPAIR.md`: Summarizes foundational concepts like
  the probabilistic repair model, cost system, beam search, enhanced
  contextual awareness, and declarative rule sets.
- `PROBABILISTIC_REPAIR_MODEL.md`: Elaborates on the mechanics of the
  cost system (how costs are defined and influenced) and the beam
  search engine (workflow, beam width trade-offs, interaction with
  layers).
- `ADVANCED_HEURISTICS.md`: Explores ideas for more sophisticated
  heuristics, methods for enriching the `JsonContext` (e.g., N-gram
  token history, structural depth, key duplication tracking), and
  examples of advanced rules for the declarative rule set.
- `ADAPTIVE_REPAIR.md`: Discusses potential future-state capabilities
  like dynamic cost adjustments, learning common non-standard patterns
  from specific sources, adaptive beam width, and using statistical
  heuristics from data corpora.

These documents aim to iterate on and build out the vision for a
world-class, intelligent JSON repair library in Elixir.
diff --git a/ADAPTIVE_REPAIR.md b/ADAPTIVE_REPAIR.md
@@ -0,0 +1,65 @@
+# Adaptive Repair Mechanisms and Self-Learning for JsonRemedy
+
+This document delves into more advanced, potentially future-state capabilities for JsonRemedy: adaptive repair mechanisms and self-learning. These concepts aim to enable the system to improve its repair strategies over time by learning from the data it processes and the success of its repair attempts.
+
+## Core Idea: Learning from Experience
+
+The probabilistic repair model with its cost system and beam search provides a strong foundation. Adaptive mechanisms would build on this by allowing the "costs" and even the "rules" to evolve.
+
+### 1. Dynamic Cost Adjustments
+
+-   **Concept**: The costs associated with specific repair rules or heuristics would not be static but could be adjusted based on their effectiveness.
+-   **Mechanism**:
+    -   **Success Tracking**: When a repair path (a sequence of applied rules) leads to successfully validated JSON, the rules involved in that path could have their costs slightly decreased (making them "preferred" in the future).
+    -   **Failure Tracking**: If a candidate resulting from a specific rule consistently fails validation or leads to very high-cost paths that are pruned, the cost of that rule could be slightly increased.
+    -   **Feedback Granularity**: This feedback could be global (across all uses of JsonRemedy) if data can be aggregated, or local to a specific instance or session.
+-   **Challenges**:
+    -   Avoiding overfitting to specific datasets.
+    -   Ensuring stability and preventing costs from oscillating wildly.
+    -   Determining the appropriate learning rate or magnitude of cost adjustments.
+
+### 2. Learning Common Non-Standard Patterns
+
+-   **Concept**: JsonRemedy could identify recurring non-standard patterns from a specific data source and learn to treat them as "normal" for that source, effectively creating source-specific repair profiles.
+-   **Mechanism**:
+    -   **Pattern Detection**: If the same sequence of high-cost repairs is frequently applied to inputs from a particular source (e.g., identified by a metadata tag or API endpoint), this sequence could be recognized as a "custom pattern" for that source.
+    -   **Rule Generation/Cost Lowering**:
+        -   A new, specific repair rule could be suggested or automatically generated to handle this pattern with a lower intrinsic cost *when that source profile is active*.
+        -   Alternatively, the costs of the existing rules that combine to fix this pattern could be temporarily lowered for that source.
+-   **Example**: A legacy system always outputs `{'key': 'value', 'date': 'YYYY/MM/DD'}` (using single quotes and a specific date format). JsonRemedy might initially use several high-cost rules. Over time, it could learn that for "LegacySystemX", single quotes are common (lower cost for `'` -> `"` conversion) and that `YYYY/MM/DD` is a valid date representation (lower cost for a rule that normalizes this specific date format).
+-   **User Interaction**: This would likely require user confirmation to prevent the system from learning incorrect patterns. "JsonRemedy has noticed this pattern X resulting in repair Y 100 times from source Z. Would you like to create a specialized rule for this?"
+
+### 3. Adaptive Beam Width
+
+-   **Concept**: The `beam_width` for the search engine could be adjusted dynamically.
+-   **Mechanism**:
+    -   If repair processes are consistently finding valid JSON quickly with few candidates diverging significantly in cost, the beam width could be narrowed to improve performance.
+    -   If repairs are often failing, or many candidates have similar costs (indicating high ambiguity), the beam width could be temporarily widened to explore more possibilities.
+    -   This could also be influenced by the complexity or length of the input JSON.
+
+## Statistical Heuristics from Data Corpora
+
+-   **Concept**: Analyze large corpora of known-bad and known-good JSON pairs (or just known-bad JSON that has been manually repaired) to derive statistical priors for repair costs.
+-   **Mechanism**:
+    -   Mine datasets like GitHub, Stack Overflow, or internal company logs for examples of malformed JSON and their fixes.
+    -   Calculate frequencies of certain errors (e.g., missing commas vs. unquoted keys).
+    -   Use these frequencies to inform the baseline costs of repair rules. More common errors might get slightly lower default costs.
+-   **Benefit**: This would make the default heuristics more aligned with real-world error distributions.
+
+## Challenges and Considerations
+
+-   **Complexity**: Implementing self-learning mechanisms adds significant complexity to the system.
+-   **Performance**: Learning processes, especially if run synchronously, could impact repair performance. Asynchronous learning and updates would be preferred.
+-   **Transparency and Debuggability**: It must remain clear why the system made a particular repair. Learned adjustments should be inspectable.
+-   **User Control**: Users should be able to disable learning, reset learned adaptations, or explicitly approve/reject learned patterns.
+-   **Data Requirements**: Effective learning often requires substantial amounts of data.
+-   **Risk of "Bad Learning"**: The system could learn incorrect patterns if not carefully designed, leading to worse, not better, repairs.
+
+## Potential Implementation Stages
+
+1.  **Manual/Configurable Profiles**: Allow users to define source-specific cost adjustments or rule sets as a first step.
+2.  **Basic Success/Failure Cost Adjustments**: Implement simple dynamic cost changes based on rule success in validated JSON.
+3.  **Pattern Suggestion**: Introduce mechanisms to detect frequent, high-cost repair sequences and suggest them to the user for codification into a lower-cost rule or profile.
+4.  **Automated Learning (Experimental)**: More advanced, automated learning would be a long-term research area.
+
+Adaptive repair and self-learning are ambitious goals but represent the frontier for making JsonRemedy a truly intelligent and evolving tool that not only fixes JSON but also adapts to the ever-changing landscape of data sources and their quirks.
diff --git a/ADVANCED_HEURISTICS.md b/ADVANCED_HEURISTICS.md
@@ -0,0 +1,85 @@
+# Advanced Heuristics and Contextual Understanding in JsonRemedy
+
+Building upon the probabilistic repair model, this document explores advanced heuristics and enhancements to contextual understanding. These ideas aim to further refine JsonRemedy's ability to make intelligent repair decisions, leading to more accurate and semantically correct JSON outputs.
+
+## Enriching `JsonContext` for Deeper Understanding
+
+The `JsonContext` is pivotal for nuanced repairs. Beyond the previously suggested `last_significant_char`, `last_token_type`, and `lookahead_buffer`, we can incorporate more sophisticated tracking:
+
+1.  **N-gram Token History**:
+    *   Instead of just the `last_token_type`, maintain a short history (e.g., the last 2-3 tokens). `[:key, :colon, :string_value]` provides much more context than just `:string_value`.
+    *   This can help differentiate ambiguous situations. For example, a standalone number might be part of a list `[1, 2, 3]` or an error `{"key": 1 2}`. Token history can help assign costs.
+
+2.  **Structural Depth and Type Stack**:
+    *   Maintain the current nesting `depth`.
+    *   Keep a `type_stack` (e.g., `[:object, :array, :object]`). This is more robust than just `current_type`.
+    *   This helps in validating structural integrity and applying repairs that are sensitive to nesting levels (e.g., maximum depth constraints, typical array/object patterns).
+
+3.  **Key Duplication Tracking**:
+    *   Within an object context, keep a set of keys already encountered at the current nesting level.
+    *   This allows the system to assign a higher cost to repairs that would result in duplicate keys, or to automatically rename a duplicate key (e.g., `key_1`, `key_2`) with an associated cost.
+
+4.  **Value Type Affinity**:
+    *   For arrays, observe the types of initial elements. If an array starts with `[1, 2, "abc", 3]`, the string `"abc"` might be an error. A heuristic could assign a cost to type inconsistencies within an array.
+    *   Similarly, if a key `age` consistently has integer values, encountering `{"age": "forty"}` might trigger a higher cost for keeping it as a string versus attempting a conversion or flagging. This borders on semantic understanding.
+
+5.  **Whitespace and Comment Significance**:
+    *   Track if significant whitespace (e.g., multiple newlines) or comments separate tokens. This can sometimes indicate intended separation or grouping that typical JSON parsers ignore but might be relevant for repair heuristics.
+    *   *Example*: `{"key1": "value1"}
+
+ {"key2": "value2"}` is more likely two objects needing to be wrapped in an array than `{"key1": "value1"}{"key2": "value2"}`.
+
+## Advanced Heuristics for the Declarative Rule Set
+
+The declarative rule set within `Layer3.SyntaxNormalization` (and potentially other layers) can be expanded with more sophisticated rules:
+
+1.  **Context-Sensitive Auto-Correction of Common Typos**:
+    *   **Rule**: If an unquoted literal like `flase`, `ture`, `nill`, `Nnoe` appears in a value context.
+    *   **Repair**: Correct to `false`, `true`, `null`.
+    *   **Cost**: Low.
+    *   **Context**: `JsonContext` indicates it's a value position.
+
+2.  **Intelligent Missing Comma/Colon Insertion**:
+    *   **Rule**: If `JsonContext.last_token_type` is `:string_value` and the next token is `:string_literal` (unquoted) in an object key context.
+    *   **Repair A**: Insert comma (treat as `value, new_key`). Cost: Medium.
+    *   **Repair B**: Insert colon (treat as `{"original_value_as_key": new_key}`). Cost: High.
+    *   The enriched context (N-gram token history) helps decide. If `last_tokens` were `[:key, :colon, :string_value]`, Repair A is more likely.
+
+3.  **Handling of Concatenated JSON in Strings**:
+    *   **Rule**: A string value itself contains what appears to be a complete JSON object or array (e.g., `"{"inner_key": "inner_value"}"`).
+    *   **Repair A**: Keep as an escaped string (default). Cost: Low.
+    *   **Repair B**: Unescape and parse it as a nested structure. Cost: Medium-High (as it changes semantics).
+    *   **Condition**: This could be triggered by a user option or if the outer JSON structure is otherwise trivial (e.g., just one key-value pair).
+
+4.  **Heuristics for Truncated Structures**:
+    *   **Rule**: Input ends abruptly while `JsonContext.type_stack` is not empty (e.g., `{"key": ["value1",` ).
+    *   **Repair**: Add appropriate closing delimiters (`]` and `}`).
+    *   **Cost**: Medium, increases with the number of delimiters to add.
+    *   **Refinement**: If `lookahead_buffer` (if reading from a stream) suggests more data might come, cost of closing could be higher, or it might generate a "wait/retry" candidate.
+
+5.  **Repairing Numeric Value Errors**:
+    *   **Rule**: A number contains multiple decimal points (`1.2.3`) or misplaced commas (`1,234.56` not as thousands separators in some locales).
+    *   **Repair A**: Treat as string. Cost: Medium.
+    *   **Repair B**: Attempt to fix based on common patterns (e.g., keep first decimal, remove others). Cost: Medium-High.
+    *   **Repair C (for `1,234.56`):** If a "locale" or "number style" option is active, parse by removing group separators. Cost: Low-Medium.
+
+6.  **Semantic Heuristics (More Experimental)**:
+    *   **Rule**: Key name suggests a type (e.g., `isActive`, `count`, `nameList`).
+    *   **Context**: `JsonContext` includes a (possibly configurable) dictionary of common key names and their expected value types (e.g., `isActive: boolean`, `count: number`, `nameList: array`).
+    *   **Repair**: If the actual value type mismatches, assign a higher cost to keeping it as is versus attempting a conversion or flagging.
+    *   *Example*: `{"isActive": "True_String"}`. Cost to keep as string is higher if `isActive` is known to expect boolean. Cost to convert "True_String" to `true` (if possible) is lower.
+    *   This is complex as it borders on schema validation/inference.
+
+7.  **Balancing Repairs for Unmatched Delimiters**:
+    *   **Rule**: An unmatched closing delimiter (e.g., `}`) is found.
+    *   **Context**: `JsonContext.type_stack` shows the current open structure (e.g., `[:object, :array]`).
+    *   **Repair A**: Delete the unmatched delimiter. Cost: Medium.
+    *   **Repair B**: Insert corresponding opening delimiter(s) earlier in the text (if a plausible point can be found). Cost: High.
+    *   The beam search would explore both. If deleting the `}` leads to a valid parse with lower cost, it's preferred. If inserting `{[` earlier resolves more issues, that path might win.
+
+## Dynamic Heuristic Adjustment
+
+-   **Feedback Loop**: If the `Validation` layer (Layer 4) frequently rejects candidates derived from a specific heuristic, the cost associated with that heuristic could be dynamically (or manually) increased.
+-   **Source-Specific Profiles**: If JsonRemedy often processes data from a known source with idiosyncratic error patterns, users could define profiles that adjust costs for certain rules or enable source-specific heuristics.
+
+By combining a richer understanding of the JSON's context with a flexible set of advanced heuristics (each with a well-considered cost), JsonRemedy can significantly improve its ability to not just fix syntax but to infer the most probable *intended* structure of malformed JSON.
diff --git a/INNOVATIONS_IN_JSON_REPAIR.md b/INNOVATIONS_IN_JSON_REPAIR.md
@@ -0,0 +1,58 @@
+# Innovations in JSON Repair for JsonRemedy
+
+This document outlines foundational innovative ideas for advancing the capabilities of the JsonRemedy library, drawing inspiration from the concepts detailed in `docs/design/1.md`. These ideas aim to elevate JsonRemedy to a world-class JSON repair tool by moving beyond deterministic fixes to a more intelligent, adaptable, and robust system.
+
+## Core Concepts for Next-Generation JSON Repair
+
+The central theme is a shift from a linear, deterministic repair pipeline to a **probabilistic and context-aware repair engine**. This engine would explore multiple potential fixes and select the most likely valid JSON structure.
+
+### 1. Probabilistic Repair Model & Cost System
+
+-   **Concept**: Instead of a layer making a single, definitive change, it proposes multiple *repair candidates*.
+-   **Cost Assignment**: Each candidate is assigned a "cost" (or negative log-likelihood). This cost quantifies how drastic or unusual the repair is.
+    -   Simple fixes (e.g., correcting a quote type) have low costs.
+    -   Complex changes (e.g., deleting a significant portion of text or deeply restructuring nested elements) have high costs.
+-   **Goal**: To find the repair path that results in valid JSON with the minimum total cost. This reframes repair as finding the "most probable valid JSON given the malformed input."
+
+### 2. Beam Search Engine
+
+-   **Concept**: To manage the exploration of multiple repair candidates without exponential complexity, a **beam search** algorithm would be employed.
+-   **Workflow**:
+    1.  The engine starts with the initial input as the first candidate (cost 0).
+    2.  Each layer processes the current set of promising candidates (those within the "beam").
+    3.  A layer can generate multiple new candidates from each input candidate.
+    4.  After each layer, the engine prunes the expanded list of candidates, keeping only the top `N` (the `beam_width`) lowest-cost candidates.
+    5.  This process continues through all layers.
+-   **Outcome**: The final selection is the candidate with the lowest cumulative cost that successfully validates as JSON.
+-   **Benefit**: This allows JsonRemedy to explore various repair hypotheses simultaneously and choose the globally most plausible one, rather than getting stuck on a locally optimal but incorrect fix.
+
+### 3. Enhanced Contextual Awareness
+
+-   **Concept**: To make more informed repair decisions and assign costs more accurately, the `JsonContext` (the data structure tracking the state of parsing) needs to be significantly enriched.
+-   **Enhancements**: Beyond basic state (e.g., inside an object, inside an array), the context should track:
+    -   `last_significant_char`: The last non-whitespace character encountered.
+    -   `last_token_type`: The type of the last logical JSON token (e.g., string, number, brace).
+    -   `lookahead_buffer`: A small buffer of upcoming characters to allow for more informed decisions without extensive re-parsing.
+-   **Benefit**: A richer context enables more nuanced heuristics. For example, the cost of inserting a comma can be very low if the `last_token_type` was a value and the next token also appears to be a value.
+
+### 4. Declarative Rule Set for Heuristics
+
+-   **Concept**: Many complex repair heuristics currently embedded in imperative code (as seen in some other JSON repair libraries) should be codified as a **declarative, extensible rule set**.
+-   **Structure**: Each rule would define:
+    -   A `name` for the rule.
+    -   A `context_pattern` to match against the current `JsonContext`.
+    -   A `char_pattern` to match against the upcoming text.
+    -   The `repair` action to take (e.g., insert character, replace text).
+    -   The `cost` associated with applying this repair.
+-   **Benefit**: This approach makes the repair logic more transparent, easier to understand, extend, test, and maintain. Complex decision-making becomes a matter of defining and ordering rules.
+
+## Implications
+
+Adopting these innovations would represent a significant evolution for JsonRemedy:
+
+-   **Increased Robustness**: Ability to handle more ambiguous and complex errors by exploring multiple solutions.
+-   **Greater Accuracy**: The cost system, informed by rich context, can lead to more semantically correct repairs.
+-   **Improved Extensibility**: New repair strategies can be added more easily via the declarative rule set.
+-   **Principled Design**: Moves from purely heuristic-based fixes to a more formal, tunable model of repair.
+
+These ideas lay the groundwork for a truly intelligent JSON repair system. Further documents will explore specific aspects of this vision in more detail.
diff --git a/PROBABILISTIC_REPAIR_MODEL.md b/PROBABILISTIC_REPAIR_MODEL.md