Skip to content

Heron AI 90min Work Test | Project - Michael Chen METR Sabotage Threat Modeling. A small eval harness designed to measure Monitor Negligence. The script simulates a "Monitor" (the insider) reviewing dangerous code (e.g., model poisoning, data exfiltration) and an "Auditor" (gpt-5.2) that grades the monitor's reasoning and decisions.

License

Notifications You must be signed in to change notification settings

davidkimai/sabotage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

sabotage

Heron AI Work Test | Project - Michael Chen Sabotage Threat Modeling.

An automated eval harness designed to measure Monitor Negligence. The script simulates a "Monitor" (the insider) reviewing dangerous code (e.g., model poisoning, data exfiltration) and an "Auditor" (gpt-5.2) that grades the monitor's reasoning and decisions.

About

Heron AI 90min Work Test | Project - Michael Chen METR Sabotage Threat Modeling. A small eval harness designed to measure Monitor Negligence. The script simulates a "Monitor" (the insider) reviewing dangerous code (e.g., model poisoning, data exfiltration) and an "Auditor" (gpt-5.2) that grades the monitor's reasoning and decisions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages