feat: implement agentic power-steering analysis#2365
Conversation
Fixes #2355 Replace regex-based validation with Claude SDK intelligent analysis. Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - PassedAll changed files in this PR are legitimate production code, configuration, or test files. No ephemeral content detected. Analysis Summary:
This PR implements a feature (agentic power-steering analysis) with proper production code structure.
|
There was a problem hiding this comment.
Pull request overview
Implements a more context-aware (“agentic”) power-steering workflow invocation analysis using Claude SDK, replacing the prior regex-based workflow invocation validator and extending evidence/state-based completion verification to reduce false positives (Fixes #2355).
Changes:
- Add Claude-SDK-powered
analyze_workflow_invocation*()and wire it into workflow invocation checks. - Expand power-steering checking with evidence/state verification paths (PR merged/user confirmation/compaction-aware verification) and updated session-type heuristics/timeouts.
- Remove the obsolete regex-based
workflow_invocation_validator.pyand its pytest suite; bump project version to0.5.29.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Patch version bump to 0.5.29. |
| .claude/tools/amplihack/hooks/claude_power_steering.py | Adds SDK workflow-invocation analysis (sync/async) and integrates with existing SDK wrappers. |
| .claude/tools/amplihack/hooks/power_steering_checker.py | Switches workflow invocation check to SDK-based analysis (no regex validator). |
| amplifier-bundle/tools/amplihack/hooks/claude_power_steering.py | Mirrors SDK workflow invocation analysis in bundle distribution. |
| amplifier-bundle/tools/amplihack/hooks/power_steering_checker.py | Mirrors checker updates in bundle distribution. |
| docs/claude/tools/amplihack/hooks/claude_power_steering.py | Documents/mirrors SDK workflow invocation analysis and prompt formatting changes. |
| docs/claude/tools/amplihack/hooks/power_steering_checker.py | Documents/mirrors checker updates including retry write helper and next-steps heuristics. |
| .claude/tools/amplihack/hooks/workflow_invocation_validator.py | Deletes obsolete regex-based validator. |
| .claude/tools/amplihack/hooks/tests/test_workflow_invocation_validator.py | Deletes tests for removed validator. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _format_conversation_summary(conversation: list[dict], max_length: int | None = None) -> str: | ||
| """Format conversation summary for analysis. | ||
|
|
||
| Args: | ||
| conversation: List of message dicts |
There was a problem hiding this comment.
_format_conversation_summary() defaults max_length=None (unbounded), which can generate very large prompts for long transcripts and cause SDK calls to be slow/expensive or exceed context limits (leading to timeouts and fail-open). Consider a bounded default and deterministic truncation strategy.
| "INFO", | ||
| ) | ||
| return False # Work is INCOMPLETE | ||
| # Continue checking other messages (don't return immediately) | ||
| # Only STRUCTURED next steps should fail the check | ||
| break |
There was a problem hiding this comment.
In _check_next_steps(), a negation match only breaks out of the negation loop and the structured-next-steps regex still runs for the same message. This can incorrectly fail on completion statements with list formatting. Consider skipping structured detection for that message when negation matches.
| "INFO", | ||
| ) | ||
| return False # Work is INCOMPLETE | ||
| # Continue checking other messages (don't return immediately) | ||
| # Only STRUCTURED next steps should fail the check | ||
| break |
There was a problem hiding this comment.
In _check_next_steps(), a negation match only breaks out of the negation loop, but the structured-next-steps regex is still evaluated for the same message immediately afterward. This can reintroduce false failures on completion statements that include list formatting. Consider short-circuiting (e.g., continue to next message) when a negation pattern matches.
| def _format_conversation_summary(conversation: list[dict], max_length: int | None = None) -> str: | ||
| """Format conversation summary for analysis. | ||
|
|
||
| Args: | ||
| conversation: List of message dicts |
There was a problem hiding this comment.
_format_conversation_summary() defaults max_length=None (unbounded), which can build extremely large prompts for long transcripts and risk excessive latency/cost or exceeding the model context window. Consider a bounded default (chars/messages) and deterministic truncation strategy (e.g., most recent N messages + a brief header).
|
|
||
| # Check for INVOKED indicator | ||
| if "invoked:" in response_lower or "invoked" in response_lower[:50]: | ||
| return (True, None) | ||
|
|
||
| # Check for NOT INVOKED indicator | ||
| if "not invoked:" in response_lower or "not invoked" in response_lower[:50]: | ||
| # Extract reason from response | ||
| idx = response.lower().find("not invoked:") | ||
| if idx != -1: | ||
| reason = response[idx + 12 :].strip() | ||
| # Clean up and truncate | ||
| if reason and len(reason) > 10: | ||
| return (False, reason[:200]) | ||
| return (False, "Workflow not properly invoked") | ||
|
|
There was a problem hiding this comment.
In analyze_workflow_invocation(), the INVOKED check will also match "NOT INVOKED" responses because it searches for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking NOT INVOKED first and/or using strict prefix matching.
| # Check for INVOKED indicator | |
| if "invoked:" in response_lower or "invoked" in response_lower[:50]: | |
| return (True, None) | |
| # Check for NOT INVOKED indicator | |
| if "not invoked:" in response_lower or "not invoked" in response_lower[:50]: | |
| # Extract reason from response | |
| idx = response.lower().find("not invoked:") | |
| if idx != -1: | |
| reason = response[idx + 12 :].strip() | |
| # Clean up and truncate | |
| if reason and len(reason) > 10: | |
| return (False, reason[:200]) | |
| return (False, "Workflow not properly invoked") | |
| cleaned = response_lower.lstrip() | |
| # Check for NOT INVOKED indicator first to avoid matching "invoked" inside "not invoked" | |
| if cleaned.startswith("not invoked"): | |
| # Extract reason from response | |
| idx = response_lower.find("not invoked:") | |
| if idx != -1: | |
| reason = response[idx + len("not invoked:") :].strip() | |
| # Clean up and truncate | |
| if reason and len(reason) > 10: | |
| return (False, reason[:200]) | |
| return (False, "Workflow not properly invoked") | |
| # Check for INVOKED indicator | |
| if cleaned.startswith("invoked"): | |
| return (True, None) |
| filepath.write_text(data) | ||
| else: # append mode | ||
| with open(filepath, mode) as f: |
There was a problem hiding this comment.
_write_with_retry() writes text without specifying an encoding (Path.write_text default and open(...)). In non-UTF-8 locales this can yield inconsistent log/summary output or failures. Use an explicit encoding (e.g., utf-8) for both the write and append paths.
| filepath.write_text(data) | |
| else: # append mode | |
| with open(filepath, mode) as f: | |
| filepath.write_text(data, encoding="utf-8") | |
| else: # append mode | |
| with open(filepath, mode, encoding="utf-8") as f: |
| response: Full SDK response text | ||
|
|
||
| Returns: | ||
| Extracted reason string (truncated to 200 chars), or generic fallback |
There was a problem hiding this comment.
The _extract_reason_from_response() docstring says the returned reason is truncated to 200 chars, but the implementation now returns the full extracted reason. Update the docstring to match behavior, or reintroduce truncation if downstream output/logging expects a limit.
| Extracted reason string (truncated to 200 chars), or generic fallback | |
| Full extracted reason string, or generic fallback |
| response_lower = response.lower() | ||
|
|
||
| # Check for INVOKED indicator | ||
| if "invoked:" in response_lower or "invoked" in response_lower[:50]: | ||
| return (True, None) | ||
|
|
||
| # Check for NOT INVOKED indicator | ||
| if "not invoked:" in response_lower or "not invoked" in response_lower[:50]: | ||
| # Extract reason from response | ||
| idx = response.lower().find("not invoked:") | ||
| if idx != -1: | ||
| reason = response[idx + 12 :].strip() | ||
| # Clean up and truncate | ||
| if reason and len(reason) > 10: | ||
| return (False, reason[:200]) | ||
| return (False, "Workflow not properly invoked") | ||
|
|
There was a problem hiding this comment.
In analyze_workflow_invocation(), the INVOKED check will also match responses that start with "NOT INVOKED" because it looks for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking for NOT INVOKED first and/or using a strict prefix match (e.g., anchored regex for ^INVOKED: vs ^NOT INVOKED:).
| response_lower = response.lower() | |
| # Check for INVOKED indicator | |
| if "invoked:" in response_lower or "invoked" in response_lower[:50]: | |
| return (True, None) | |
| # Check for NOT INVOKED indicator | |
| if "not invoked:" in response_lower or "not invoked" in response_lower[:50]: | |
| # Extract reason from response | |
| idx = response.lower().find("not invoked:") | |
| if idx != -1: | |
| reason = response[idx + 12 :].strip() | |
| # Clean up and truncate | |
| if reason and len(reason) > 10: | |
| return (False, reason[:200]) | |
| return (False, "Workflow not properly invoked") | |
| # Normalize response for analysis | |
| response_stripped = response.lstrip() | |
| response_lower = response_stripped.lower() | |
| # Check for NOT INVOKED indicator first to avoid matching "invoked" inside "not invoked" | |
| if response_lower.startswith("not invoked:") or response_lower.startswith("not invoked"): | |
| # Extract reason from response | |
| idx = response_lower.find("not invoked:") | |
| if idx != -1: | |
| # 12 = len("not invoked:") | |
| reason = response_stripped[idx + 12 :].strip() | |
| # Clean up and truncate | |
| if reason and len(reason) > 10: | |
| return (False, reason[:200]) | |
| return (False, "Workflow not properly invoked") | |
| # Check for INVOKED indicator with strict prefix match | |
| if response_lower.startswith("invoked:") or response_lower.startswith("invoked"): | |
| return (True, None) |
| # Check for INVOKED indicator | ||
| if "invoked:" in response_lower or "invoked" in response_lower[:50]: | ||
| return (True, None) | ||
|
|
||
| # Check for NOT INVOKED indicator |
There was a problem hiding this comment.
In analyze_workflow_invocation(), the INVOKED check will also match "NOT INVOKED" responses because it searches for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking NOT INVOKED first and/or using strict prefix matching.
- Fix analyze_workflow_invocation(): check NOT INVOKED before INVOKED to prevent substring match false positives on "not invoked" responses - Fix _format_conversation_summary(): add bounded default max_length=50000 to prevent oversized SDK prompts for long transcripts - Fix _check_next_steps(): use negation_matched flag with continue to skip structured next-steps detection when negation pattern already matched, preventing false failures on completion statements with list formatting - Fix _write_with_retry(): add encoding="utf-8" to both write paths for consistent behavior across locales - Fix _extract_reason_from_response() docstring to match implementation (returns full reason string, not truncated to 200 chars) - Remove test_workflow_invocation_validator_simple.py (tests deleted module) - Remove test_validator_import() from checker unit tests (references deleted workflow_invocation_validator module) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
25 behavioral tests verifying the 5 Copilot review issues are fixed: - TestWorkflowInvocationNotInvokedPriority: NOT INVOKED priority over INVOKED - TestFormatConversationSummaryBoundedLength: max_length=50000 bounded default - TestCheckNextStepsNegationLogic: negation prevents false next-steps failures - TestWriteWithRetryEncoding: UTF-8 encoding for cross-locale consistency - TestExtractReasonDocstringAccuracy: docstring matches actual behavior Also includes YAML scenario file for gadugi-agentic-test framework. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…encoding - Add MAX_CONVERSATION_SUMMARY_LENGTH = 512_000 as named constant (previously bare 50000 magic number; 512K is appropriate for 1M context window models) - Use named constant in _format_conversation_summary() default parameter - Fix log_file.write_text() missing encoding="utf-8" (same class as Fix 4, missed in previous commit) - Update behavioral test to validate 512K lower bound matches model reality Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
🤖 Auto-fixed version bump The version in If you need a minor or major version bump instead, please update |
Repo Guardian - PassedAll changed files in this PR are legitimate production code, tests, or configuration files. No ephemeral content detected. Analysis Summary:
This PR implements agentic power-steering analysis with proper production code structure and regression test coverage.
|
Fixes #2355
Summary
Replaces regex-based power-steering validation with intelligent Claude SDK analysis that understands context and intent.
Problem
Power-steering produced excessive false positives because regex patterns couldn't understand:
Solution
Fully Agentic Analysis: Use Claude Agent SDK for ALL validation
Changes
Enhanced claude_power_steering.py
analyze_workflow_invocation()with context-aware promptsUpdated power_steering_checker.py
Deleted obsolete files
workflow_invocation_validator.py(regex-based)Benefits
Testing
Files Changed
.claude/tools/amplihack/hooks/claude_power_steering.py(+2981, -966).claude/tools/amplihack/hooks/power_steering_checker.py(updated).claude/tools/amplihack/hooks/workflow_invocation_validator.py(deleted)