rewardhackwatch/CITATION.cff at main · aerosta/rewardhackwatch · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
cff-version: 1.2.0
title: "RewardHackWatch: Real-Time Detection of Reward Hacking Generalization"
message: "If you use RewardHackWatch in your research, please cite it using this metadata."
type: software
authors:
  - name: "Aerosta"
repository-code: "https://github.com/aerosta/rewardhackwatch"
url: "https://github.com/aerosta/rewardhackwatch"
license: Apache-2.0
version: "1.3.0"
date-released: "2025-12-09"
keywords:
  - ai-safety
  - reward-hacking
  - misalignment
  - llm
  - monitoring
  - machine-learning
abstract: >-
  RewardHackWatch is a real-time detection system for identifying
  when reward hacking in LLM agents generalizes to broader misalignment
  behaviors. It combines pattern detection, ML classification, LLM judges,
  and generalization tracking to catch the critical transition point
  from task-specific cheating to dangerous misalignment.
references:
  - type: article
    title: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs"
    authors:
      - family-names: "Anthropic"
    year: 2025
    url: "https://www.anthropic.com/research/emergent-misalignment-reward-hacking"
  - type: article
    title: "Chain of Thought Monitoring"
    authors:
      - family-names: "OpenAI"
    year: 2025
    url: "https://openai.com/index/chain-of-thought-monitoring/"