Adding a "Confidence Score

A more advanced feature could be to have the `EvaluatorAgent` output a confidence score for each vulnerability it identifies. This would allow the `IntrospectionAgent` to prioritize which vulnerabilities to focus on and which might be false positives.

```json
{
  "verdict": "VULNERABILITY FOUND",
  "vulnerabilities": [
    {
      "category": "Chain-of-Thought Manipulation",
      "reason": "The model accepted the false premise...",
      "confidence": "high"
    },
    {
      "category": "Deceptive Alignment",
      "reason": "The response includes phrases that rationalize the behavior...",
      "confidence": "medium"
    }
  ]
}
```

By adding these specific, actionable instructions, you can make the `EvaluatorAgent`'s assessments more reliable and robust, moving its capabilities from general analysis to targeted, rule-based reasoning. This will provide a richer dataset for the `IntrospectionAgent` to learn from and adapt its strategies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a "Confidence Score #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Adding a "Confidence Score #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions