update task solver agent

EvenSol · EvenSol · commit 6cc8e55d655a · 2026-04-09T22:10:02.000+02:00
diff --git a/.github/agents/solve.task.agent.md b/.github/agents/solve.task.agent.md
@@ -1357,7 +1357,18 @@ Document the independent check in `step2_analysis/notes.md` under a
     you MUST update these strings to match the latest results. Where possible, let
     conclusions come from `results.json["conclusions"]` instead of hardcoding.
 
-17. **Run the report generator** to produce the engineering report (Word + HTML):
+17. **Run consistency checker** (MANDATORY before report generation):
+    ```
+    Run in terminal: python devtools/consistency_checker.py task_solve/YYYY-MM-DD_slug/
+    ```
+    The consistency checker:
+    - Extracts numerical values from all notebooks and results.json
+    - Detects inconsistencies: numerical mismatches, scope mismatches (e.g., volumetric vs mass-based), contradictory claims
+    - Produces `consistency_report.json` in the task folder
+    - **Fix any CRITICAL issues before generating the report**
+    - Common issue: external study data (e.g., Gudrun paper) measuring different quantities than notebook calculations — these need clarification in the report, not "fixing"
+
+18. **Run the report generator** to produce the engineering report (Word + HTML):
     ```
     Run in terminal: python step3_report/generate_report.py
     ```
@@ -1376,19 +1387,19 @@ Document the independent check in `step2_analysis/notes.md` under a
     - All formatting renders automatically when corresponding keys exist in
       `results.json` — no custom rendering code needed per task
 
-18. **Update the task README** (`README.md` in the task folder):
+19. **Update the task README** (`README.md` in the task folder):
     - Fill in the Problem Statement
     - Check off completed steps
     - Write the Key Results section
 
 ### Phase 4: Knowledge Capture & Contribution
 
-19. **Identify reusable outputs**:
+20. **Identify reusable outputs**:
     - If the notebook is generally useful → mention it could go to `examples/notebooks/`
     - If a NeqSim API gap was found → document it for future development
     - If a new pattern was discovered → note it for `CODE_PATTERNS.md`
 
-20. **Fix and improve documentation** encountered during the task:
+21. **Fix and improve documentation** encountered during the task:
     - If you found **errors** in existing docs (wrong API signatures, outdated
       patterns, incorrect examples), fix them and include the fixes in the PR.
     - If you discovered **missing documentation** (undocumented classes, missing
@@ -1399,7 +1410,7 @@ Document the independent check in `step2_analysis/notes.md` under a
       when adding new doc pages.
     - Documentation fixes go in the **same PR** as the task outputs.
 
-21. **Draft a task log entry** (but don't write to the file directly):
+22. **Draft a task log entry** (but don't write to the file directly):
     ```
     ### YYYY-MM-DD — Task Title
     **Type:** X (TypeName)
@@ -1409,7 +1420,7 @@ Document the independent check in `step2_analysis/notes.md` under a
     ```
     Show this to the user for them to add to `docs/development/TASK_LOG.md`.
 
-22. **Create a Pull Request** (if the user asks, or if reusable outputs were produced):
+23. **Create a Pull Request** (if the user asks, or if reusable outputs were produced):
 
     When the task produces reusable code (tests, notebooks, docs, API extensions),
     offer to create a PR. If the user confirms, execute these steps:
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -1365,7 +1365,13 @@ docs, or the workspace root.
 9. **For cost estimation:** Use component-level NeqSim classes (e.g., `SURFCostEstimator`, `SubseaCostEstimator`) instead of flat lump-sum estimates. Break down CAPEX into verifiable subcategories.
 10. **Self-review before delivering:** Re-read all formulas checking for sign errors, double-counting, wrong time indexing, and missing terms. Compare key outputs against industry benchmarks.
 11. **Benchmark validation (MANDATORY):** Create a separate benchmark notebook (`XX_benchmark_validation.ipynb`) comparing NeqSim results against independent reference data (NIST, textbook examples, published cases, industry benchmarks). Include at least 3 data points, a parity/deviation plot, and save `benchmark_validation` results to `results.json`. Include benchmark comparison in the final report.
-12. **Uncertainty analysis (MANDATORY):** Create a separate uncertainty notebook (`XX_uncertainty_risk_analysis.ipynb`) that:
+12. **Consistency check (MANDATORY before report):** Run `python devtools/consistency_checker.py task_solve/YYYY-MM-DD_slug/` before generating reports. This tool:
+    - Extracts numerical values from all notebooks and results.json
+    - Detects inconsistencies: numerical mismatches, scope mismatches (volumetric vs mass-based), contradictory claims
+    - Produces `consistency_report.json` with issues to fix
+    - **Fix CRITICAL issues before generating the report**
+    - Common issue: external study data (e.g., Gudrun) measuring different quantities than notebook calculations
+13. **Uncertainty analysis (MANDATORY):** Create a separate uncertainty notebook (`XX_uncertainty_risk_analysis.ipynb`) that:
     - Identifies key uncertain input parameters with realistic ranges (low/base/high or probability distributions)
     - **MUST use full NeqSim process simulations inside the Monte Carlo loop** — do NOT
       use simplified Python correlations when NeqSim classes exist for the calculation
diff --git a/AGENTS.md b/AGENTS.md
@@ -103,6 +103,14 @@ workspace root.
      - Saves `uncertainty` and `risk_evaluation` results to `results.json`
    - **Save results.json** in the task root (see pattern below)
 
+   **Step 2.5 — Consistency Check (MANDATORY before report)**
+   - Run `python devtools/consistency_checker.py task_solve/YYYY-MM-DD_slug/`
+   - The tool extracts numerical values from all notebooks and results.json
+   - Detects inconsistencies: numerical mismatches, scope mismatches (e.g., volumetric vs mass-based), contradictory claims
+   - Produces `consistency_report.json` in the task folder
+   - **Fix any CRITICAL issues before generating the report**
+   - Common issues: external study data measuring different quantities than notebook calculations
+
    **Step 3 — Report**
    - `generate_report.py` auto-reads `task_spec.md` and `results.json`
    - Run `python step3_report/generate_report.py` to produce a professional
@@ -294,6 +302,41 @@ if report.getWarningCount() > 0:
 assert report.isValid(), "results.json failed validation — fix errors above"
 ```
 
+### Iterative Updates to results.json
+
+When working iteratively with continuous updates:
+
+1. **Load before Modifying** — Always read existing results.json before adding new data:
+   ```python
+   results_path = TASK_DIR / "results.json"
+   if results_path.exists():
+       with open(results_path, "r") as f:
+           results = json.load(f)
+   else:
+       results = {}
+   ```
+
+2. **Use dict.update() for New Data** — Merge new results without losing existing:
+   ```python
+   results["key_results"] = {**results.get("key_results", {}), "new_result": 42.5}
+   results["figure_captions"] = {**results.get("figure_captions", {}), "new_plot.png": "Caption"}
+   ```
+
+3. **Append to Lists** — For discussion, tables, equations:
+   ```python
+   results.setdefault("figure_discussion", []).append(new_discussion)
+   results.setdefault("tables", []).append(new_table)
+   ```
+
+4. **Run Consistency Check** before report generation:
+   ```bash
+   python devtools/consistency_checker.py task_solve/YYYY-MM-DD_slug/
+   ```
+
+5. **Regenerate Report** — The report generator dynamically includes sections based on
+   what's present in results.json. Adding `uncertainty` or `risk_evaluation` automatically
+   creates those sections in the report.
+
 The report generator auto-reads this file to populate Results and Validation sections.
 - **key_results**: Rendered as styled table with auto-detected units (use suffixes like `_C`, `_bar`, `_kg`, `_hours`)
 - **validation**: Rendered as pass/fail table with color coding
diff --git a/devtools/README.md b/devtools/README.md
@@ -53,6 +53,7 @@ for a full explanation of the architecture and internals.
 | `neqsim_dev_setup.py` | JVM bootstrap, class imports, compile + kernel restart |
 | `pyproject.toml` | Makes it pip-installable (`pip install -e devtools/`) |
 | `new_task.py` | Create task-solving folders for the 4-step AI workflow |
+| `consistency_checker.py` | **Pre-report quality gate.** Extracts numerical values from notebooks and results.json, detects inconsistencies (numerical mismatches, scope mismatches, contradictory claims). Run before `generate_report.py`. Produces `consistency_report.json`. |
 | `unisim_reader.py` | UniSim/HYSYS .usc COM reader → NeqSim Python/notebook/EOT/JSON. 45+ op types, port-specific forward refs, auto-recycle wiring. |
 | `test_unisim_outputs.py` | 14 pytest tests for all UniSim converter output modes (no COM needed) |
 | `explore_unisim_com.py` | Diagnostic: dump UniSim COM object model from any .usc file |
diff --git a/devtools/consistency_checker.py b/devtools/consistency_checker.py
diff --git a/docs/development/TASK_SOLVING_GUIDE.md b/docs/development/TASK_SOLVING_GUIDE.md