Skip to content

feat: implement agentic power-steering analysis#2365

Merged
rysweet merged 7 commits intomainfrom
feat/issue-2355-agentic-power-steering
Feb 18, 2026
Merged

feat: implement agentic power-steering analysis#2365
rysweet merged 7 commits intomainfrom
feat/issue-2355-agentic-power-steering

Conversation

@rysweet
Copy link
Owner

@rysweet rysweet commented Feb 16, 2026

Fixes #2355

Summary

Replaces regex-based power-steering validation with intelligent Claude SDK analysis that understands context and intent.

Problem

Power-steering produced excessive false positives because regex patterns couldn't understand:

  • Implicit workflow following (step-by-step execution)
  • Async completion patterns (PR created for review)
  • Contextual intent vs. literal text matching

Solution

Fully Agentic Analysis: Use Claude Agent SDK for ALL validation

Changes

  1. Enhanced claude_power_steering.py

    • Added analyze_workflow_invocation() with context-aware prompts
    • Understands explicit Skill/Read tool invocation
    • Understands implicit workflow following
    • Understands async workflows (PR for review, CI running)
    • Fail-open behavior maintained
  2. Updated power_steering_checker.py

    • Removed regex validator dependency
    • Uses SDK analysis directly
    • Intelligent context understanding
  3. Deleted obsolete files

    • Removed workflow_invocation_validator.py (regex-based)
    • Removed associated tests

Benefits

  • ✅ Context-aware: Understands intent, not just text patterns
  • ✅ Async-aware: Recognizes PR reviews happen later
  • ✅ Fewer false positives: No more pattern mismatches
  • ✅ More maintainable: AI-powered vs brittle regex

Testing

  • ✅ All pre-commit hooks passing
  • ✅ Type checking passing
  • ✅ Syntax validation passing
  • 📝 Integration testing needed with real sessions

Files Changed

  • .claude/tools/amplihack/hooks/claude_power_steering.py (+2981, -966)
  • .claude/tools/amplihack/hooks/power_steering_checker.py (updated)
  • .claude/tools/amplihack/hooks/workflow_invocation_validator.py (deleted)

Ubuntu and others added 2 commits February 16, 2026 04:40
Fixes #2355

Replace regex-based validation with Claude SDK intelligent analysis.

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

🤖 Auto-fixed version bump

The version in pyproject.toml has been automatically bumped to the next patch version.

If you need a minor or major version bump instead, please update pyproject.toml manually and push the change.

@github-actions
Copy link
Contributor

Repo Guardian - Passed

All changed files in this PR are legitimate production code, configuration, or test files. No ephemeral content detected.

Analysis Summary:

  • ✅ Modified files: Production Python modules (claude_power_steering.py, power_steering_checker.py) and configuration (pyproject.toml)
  • ✅ Removed files: Test/validation modules no longer needed
  • ✅ No temporal filenames (dates, "temp", "hack", "one-off", etc.)
  • ✅ No point-in-time documents (notes, status updates, investigation logs)
  • ✅ No temporary scripts or one-off utilities

This PR implements a feature (agentic power-steering analysis) with proper production code structure.

AI generated by Repo Guardian

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements a more context-aware (“agentic”) power-steering workflow invocation analysis using Claude SDK, replacing the prior regex-based workflow invocation validator and extending evidence/state-based completion verification to reduce false positives (Fixes #2355).

Changes:

  • Add Claude-SDK-powered analyze_workflow_invocation*() and wire it into workflow invocation checks.
  • Expand power-steering checking with evidence/state verification paths (PR merged/user confirmation/compaction-aware verification) and updated session-type heuristics/timeouts.
  • Remove the obsolete regex-based workflow_invocation_validator.py and its pytest suite; bump project version to 0.5.29.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pyproject.toml Patch version bump to 0.5.29.
.claude/tools/amplihack/hooks/claude_power_steering.py Adds SDK workflow-invocation analysis (sync/async) and integrates with existing SDK wrappers.
.claude/tools/amplihack/hooks/power_steering_checker.py Switches workflow invocation check to SDK-based analysis (no regex validator).
amplifier-bundle/tools/amplihack/hooks/claude_power_steering.py Mirrors SDK workflow invocation analysis in bundle distribution.
amplifier-bundle/tools/amplihack/hooks/power_steering_checker.py Mirrors checker updates in bundle distribution.
docs/claude/tools/amplihack/hooks/claude_power_steering.py Documents/mirrors SDK workflow invocation analysis and prompt formatting changes.
docs/claude/tools/amplihack/hooks/power_steering_checker.py Documents/mirrors checker updates including retry write helper and next-steps heuristics.
.claude/tools/amplihack/hooks/workflow_invocation_validator.py Deletes obsolete regex-based validator.
.claude/tools/amplihack/hooks/tests/test_workflow_invocation_validator.py Deletes tests for removed validator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 427 to 431
def _format_conversation_summary(conversation: list[dict], max_length: int | None = None) -> str:
"""Format conversation summary for analysis.

Args:
conversation: List of message dicts
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_format_conversation_summary() defaults max_length=None (unbounded), which can generate very large prompts for long transcripts and cause SDK calls to be slow/expensive or exceed context limits (leading to timeouts and fail-open). Consider a bounded default and deterministic truncation strategy.

Copilot uses AI. Check for mistakes.
Comment on lines 3752 to 3756
"INFO",
)
return False # Work is INCOMPLETE
# Continue checking other messages (don't return immediately)
# Only STRUCTURED next steps should fail the check
break
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _check_next_steps(), a negation match only breaks out of the negation loop and the structured-next-steps regex still runs for the same message. This can incorrectly fail on completion statements with list formatting. Consider skipping structured detection for that message when negation matches.

Copilot uses AI. Check for mistakes.
Comment on lines 3752 to 3756
"INFO",
)
return False # Work is INCOMPLETE
# Continue checking other messages (don't return immediately)
# Only STRUCTURED next steps should fail the check
break
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _check_next_steps(), a negation match only breaks out of the negation loop, but the structured-next-steps regex is still evaluated for the same message immediately afterward. This can reintroduce false failures on completion statements that include list formatting. Consider short-circuiting (e.g., continue to next message) when a negation pattern matches.

Copilot uses AI. Check for mistakes.
Comment on lines 427 to 431
def _format_conversation_summary(conversation: list[dict], max_length: int | None = None) -> str:
"""Format conversation summary for analysis.

Args:
conversation: List of message dicts
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_format_conversation_summary() defaults max_length=None (unbounded), which can build extremely large prompts for long transcripts and risk excessive latency/cost or exceeding the model context window. Consider a bounded default (chars/messages) and deterministic truncation strategy (e.g., most recent N messages + a brief header).

Copilot uses AI. Check for mistakes.
Comment on lines 1069 to 1084

# Check for INVOKED indicator
if "invoked:" in response_lower or "invoked" in response_lower[:50]:
return (True, None)

# Check for NOT INVOKED indicator
if "not invoked:" in response_lower or "not invoked" in response_lower[:50]:
# Extract reason from response
idx = response.lower().find("not invoked:")
if idx != -1:
reason = response[idx + 12 :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In analyze_workflow_invocation(), the INVOKED check will also match "NOT INVOKED" responses because it searches for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking NOT INVOKED first and/or using strict prefix matching.

Suggested change
# Check for INVOKED indicator
if "invoked:" in response_lower or "invoked" in response_lower[:50]:
return (True, None)
# Check for NOT INVOKED indicator
if "not invoked:" in response_lower or "not invoked" in response_lower[:50]:
# Extract reason from response
idx = response.lower().find("not invoked:")
if idx != -1:
reason = response[idx + 12 :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")
cleaned = response_lower.lstrip()
# Check for NOT INVOKED indicator first to avoid matching "invoked" inside "not invoked"
if cleaned.startswith("not invoked"):
# Extract reason from response
idx = response_lower.find("not invoked:")
if idx != -1:
reason = response[idx + len("not invoked:") :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")
# Check for INVOKED indicator
if cleaned.startswith("invoked"):
return (True, None)

Copilot uses AI. Check for mistakes.
Comment on lines 143 to 145
filepath.write_text(data)
else: # append mode
with open(filepath, mode) as f:
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_write_with_retry() writes text without specifying an encoding (Path.write_text default and open(...)). In non-UTF-8 locales this can yield inconsistent log/summary output or failures. Use an explicit encoding (e.g., utf-8) for both the write and append paths.

Suggested change
filepath.write_text(data)
else: # append mode
with open(filepath, mode) as f:
filepath.write_text(data, encoding="utf-8")
else: # append mode
with open(filepath, mode, encoding="utf-8") as f:

Copilot uses AI. Check for mistakes.
response: Full SDK response text

Returns:
Extracted reason string (truncated to 200 chars), or generic fallback
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _extract_reason_from_response() docstring says the returned reason is truncated to 200 chars, but the implementation now returns the full extracted reason. Update the docstring to match behavior, or reintroduce truncation if downstream output/logging expects a limit.

Suggested change
Extracted reason string (truncated to 200 chars), or generic fallback
Full extracted reason string, or generic fallback

Copilot uses AI. Check for mistakes.
Comment on lines 1068 to 1084
response_lower = response.lower()

# Check for INVOKED indicator
if "invoked:" in response_lower or "invoked" in response_lower[:50]:
return (True, None)

# Check for NOT INVOKED indicator
if "not invoked:" in response_lower or "not invoked" in response_lower[:50]:
# Extract reason from response
idx = response.lower().find("not invoked:")
if idx != -1:
reason = response[idx + 12 :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")

Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In analyze_workflow_invocation(), the INVOKED check will also match responses that start with "NOT INVOKED" because it looks for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking for NOT INVOKED first and/or using a strict prefix match (e.g., anchored regex for ^INVOKED: vs ^NOT INVOKED:).

Suggested change
response_lower = response.lower()
# Check for INVOKED indicator
if "invoked:" in response_lower or "invoked" in response_lower[:50]:
return (True, None)
# Check for NOT INVOKED indicator
if "not invoked:" in response_lower or "not invoked" in response_lower[:50]:
# Extract reason from response
idx = response.lower().find("not invoked:")
if idx != -1:
reason = response[idx + 12 :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")
# Normalize response for analysis
response_stripped = response.lstrip()
response_lower = response_stripped.lower()
# Check for NOT INVOKED indicator first to avoid matching "invoked" inside "not invoked"
if response_lower.startswith("not invoked:") or response_lower.startswith("not invoked"):
# Extract reason from response
idx = response_lower.find("not invoked:")
if idx != -1:
# 12 = len("not invoked:")
reason = response_stripped[idx + 12 :].strip()
# Clean up and truncate
if reason and len(reason) > 10:
return (False, reason[:200])
return (False, "Workflow not properly invoked")
# Check for INVOKED indicator with strict prefix match
if response_lower.startswith("invoked:") or response_lower.startswith("invoked"):
return (True, None)

Copilot uses AI. Check for mistakes.
Comment on lines 1070 to 1074
# Check for INVOKED indicator
if "invoked:" in response_lower or "invoked" in response_lower[:50]:
return (True, None)

# Check for NOT INVOKED indicator
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In analyze_workflow_invocation(), the INVOKED check will also match "NOT INVOKED" responses because it searches for the substring "invoked" before checking the negative case. This makes a clear "NOT INVOKED: ..." response incorrectly return valid. Fix by checking NOT INVOKED first and/or using strict prefix matching.

Copilot uses AI. Check for mistakes.
Ubuntu and others added 2 commits February 18, 2026 05:39
- Fix analyze_workflow_invocation(): check NOT INVOKED before INVOKED to
  prevent substring match false positives on "not invoked" responses
- Fix _format_conversation_summary(): add bounded default max_length=50000
  to prevent oversized SDK prompts for long transcripts
- Fix _check_next_steps(): use negation_matched flag with continue to skip
  structured next-steps detection when negation pattern already matched,
  preventing false failures on completion statements with list formatting
- Fix _write_with_retry(): add encoding="utf-8" to both write paths for
  consistent behavior across locales
- Fix _extract_reason_from_response() docstring to match implementation
  (returns full reason string, not truncated to 200 chars)
- Remove test_workflow_invocation_validator_simple.py (tests deleted module)
- Remove test_validator_import() from checker unit tests (references deleted
  workflow_invocation_validator module)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
25 behavioral tests verifying the 5 Copilot review issues are fixed:
- TestWorkflowInvocationNotInvokedPriority: NOT INVOKED priority over INVOKED
- TestFormatConversationSummaryBoundedLength: max_length=50000 bounded default
- TestCheckNextStepsNegationLogic: negation prevents false next-steps failures
- TestWriteWithRetryEncoding: UTF-8 encoding for cross-locale consistency
- TestExtractReasonDocstringAccuracy: docstring matches actual behavior

Also includes YAML scenario file for gadugi-agentic-test framework.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@rysweet rysweet marked this pull request as ready for review February 18, 2026 05:41
Ubuntu and others added 3 commits February 18, 2026 21:43
…encoding

- Add MAX_CONVERSATION_SUMMARY_LENGTH = 512_000 as named constant (previously
  bare 50000 magic number; 512K is appropriate for 1M context window models)
- Use named constant in _format_conversation_summary() default parameter
- Fix log_file.write_text() missing encoding="utf-8" (same class as Fix 4,
  missed in previous commit)
- Update behavioral test to validate 512K lower bound matches model reality

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Contributor

🤖 Auto-fixed version bump

The version in pyproject.toml has been automatically bumped to the next patch version.

If you need a minor or major version bump instead, please update pyproject.toml manually and push the change.

@rysweet rysweet merged commit 7fe27dc into main Feb 18, 2026
1 check passed
@github-actions
Copy link
Contributor

Repo Guardian - Passed

All changed files in this PR are legitimate production code, tests, or configuration files. No ephemeral content detected.

Analysis Summary:

  • ✅ Modified files: Production Python modules and configuration files
  • ✅ Removed files: Deprecated test modules
  • ✅ Added files: test_pr2365_behavioral.py and test_pr2365_power_steering_fixes.yaml are regression tests, not point-in-time documents
    • While they reference PR feat: implement agentic power-steering analysis #2365 in their names, they provide durable test coverage to prevent bugs from reoccurring
    • Properly structured with documentation and located in tests/ directory
    • Similar to test_issue_NNNN.py pattern used in many projects
  • ✅ No temporal filenames indicating temporary content
  • ✅ No point-in-time documents (notes, status updates, investigation logs)
  • ✅ No temporary scripts or one-off utilities

This PR implements agentic power-steering analysis with proper production code structure and regression test coverage.

AI generated by Repo Guardian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Power-steering stop hook producing excessive false positives

1 participant

Comments