-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathCITATION.cff
More file actions
37 lines (37 loc) · 1.28 KB
/
CITATION.cff
File metadata and controls
37 lines (37 loc) · 1.28 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
cff-version: 1.2.0
title: "RewardHackWatch: Real-Time Detection of Reward Hacking Generalization"
message: "If you use RewardHackWatch in your research, please cite it using this metadata."
type: software
authors:
- name: "Aerosta"
repository-code: "https://github.com/aerosta/rewardhackwatch"
url: "https://github.com/aerosta/rewardhackwatch"
license: Apache-2.0
version: "1.3.0"
date-released: "2025-12-09"
keywords:
- ai-safety
- reward-hacking
- misalignment
- llm
- monitoring
- machine-learning
abstract: >-
RewardHackWatch is a real-time detection system for identifying
when reward hacking in LLM agents generalizes to broader misalignment
behaviors. It combines pattern detection, ML classification, LLM judges,
and generalization tracking to catch the critical transition point
from task-specific cheating to dangerous misalignment.
references:
- type: article
title: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs"
authors:
- family-names: "Anthropic"
year: 2025
url: "https://www.anthropic.com/research/emergent-misalignment-reward-hacking"
- type: article
title: "Chain of Thought Monitoring"
authors:
- family-names: "OpenAI"
year: 2025
url: "https://openai.com/index/chain-of-thought-monitoring/"