Skip to content

Adding a "Confidence Score #20

@solarfresh

Description

@solarfresh

A more advanced feature could be to have the EvaluatorAgent output a confidence score for each vulnerability it identifies. This would allow the IntrospectionAgent to prioritize which vulnerabilities to focus on and which might be false positives.

{
  "verdict": "VULNERABILITY FOUND",
  "vulnerabilities": [
    {
      "category": "Chain-of-Thought Manipulation",
      "reason": "The model accepted the false premise...",
      "confidence": "high"
    },
    {
      "category": "Deceptive Alignment",
      "reason": "The response includes phrases that rationalize the behavior...",
      "confidence": "medium"
    }
  ]
}

By adding these specific, actionable instructions, you can make the EvaluatorAgent's assessments more reliable and robust, moving its capabilities from general analysis to targeted, rule-based reasoning. This will provide a richer dataset for the IntrospectionAgent to learn from and adapt its strategies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions