-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
A more advanced feature could be to have the EvaluatorAgent output a confidence score for each vulnerability it identifies. This would allow the IntrospectionAgent to prioritize which vulnerabilities to focus on and which might be false positives.
{
"verdict": "VULNERABILITY FOUND",
"vulnerabilities": [
{
"category": "Chain-of-Thought Manipulation",
"reason": "The model accepted the false premise...",
"confidence": "high"
},
{
"category": "Deceptive Alignment",
"reason": "The response includes phrases that rationalize the behavior...",
"confidence": "medium"
}
]
}By adding these specific, actionable instructions, you can make the EvaluatorAgent's assessments more reliable and robust, moving its capabilities from general analysis to targeted, rule-based reasoning. This will provide a richer dataset for the IntrospectionAgent to learn from and adapt its strategies.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels