This repository contains a carefully curated collection of papers discussed in our survey: "Safety in Large Reasoning Models: A Survey". As LRMs become increasingly powerful, understanding their safety implications becomes critical for responsible AI development. We created this resource to support researchers and practitioners working in this emerging field. If you find this repo useful for your work or research, it is really appreciated if you star this repository and cite our paper.
If you find this repository helpful for your research, we would greatly appreciate it if you could cite our papers. β¨
@misc{wang2025safetylargereasoningmodels,
title={Safety in Large Reasoning Models: A Survey},
author={Cheng Wang and Yue Liu and Baolong Bi and Duzhen Zhang and Zhong-Zhi Li and Yingwei Ma and Yufei He and Shengju Yu and Xinfeng Li and Junfeng Fang and Jiaheng Zhang and Bryan Hooi},
year={2025},
eprint={2504.17704},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.17704},
}
@article{liuyue_GuardReasoner,
title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
journal={arXiv preprint arXiv:2501.18492},
year={2025}
}
@article{liuyue_GuardReasoner_VL,
title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
journal={arXiv preprint arXiv:2505.11049},
year={2025}
}
- π¬ Cutting-edge research on LRMs vulnerabilities
- π οΈ Novel attack methodologies against reasoning models
- π‘οΈ Defense strategies and safety alignment techniques
- π Regular updates as the field evolves
- β Star this repository to show support
- π Create a PR if you notice missing papers
- π£ Share with the research community
Table of Contents
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 25.04 | DeepSeek-R1 Thoughtology: Let's about LLM Reasoning | Arxiv | link | - |
| 25.01 | Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation | Arxiv | link | - |
| 25.01 | o3-mini vs DeepSeek-R1: Which One is Safer? | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 25.04 | Emerging Cyber Attack Risks of Medical AI Agents | Arxiv | link | - |
| 25.02 | Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents | Arxiv | link | link |
| 25.02 | Demonstrating specification gaming in reasoning models | Arxiv | link | - |
| 25.02 | Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? | Arxiv | link | link |
| 25.01 | Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 25.03 | Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings | Arxiv | link | link |
| 25.03 | Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives | Arxiv | link | - |
| 25.02 | Safety Evaluation of DeepSeek Models in Chinese Contexts | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 25.04 | SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Arxiv | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Overthinking | ||||
| 25.02 | OverThink: Slowdown Attacks on Reasoning LLMs | Arxiv | link | link |
| 25.01 | Trading Inference-Time Compute for Adversarial Robustness | Arxiv | link | - |
| Underthinking | ||||
| 25.01 | Trading Inference-Time Compute for Adversarial Robustness | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Reasoning-based Backdoor Attacks | ||||
| 25.04 | ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs | Arxiv | link | - |
| 25.02 | BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack | Arxiv | link | link |
| 25.01 | DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs | Arxiv | link | - |
| 24.01 | BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models | Arxiv | link | - |
| Error Injection | ||||
| 25.03 | Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| 25.02 | The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 | Arxiv | link | - |
| 25.02 | Can Indirect Prompt Injection Attacks Be Detected and Removed? | Arxiv | link | - |
| 25.01 | Trading Inference-Time Compute for Adversarial Robustness | Arxiv | link | - |
| 24.11 | Defense Against Prompt Injection Attack by Leveraging Attack Techniques | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Prompt-based Attacks | ||||
| 25.04 | SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models | Arxiv | link | link |
| 25.03 | Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings | Arxiv | link | link |
| 24.07 | Does Refusal Training in LLMs Generalize to the Past Tense? | Arxiv | link | link |
| Multi-Turn Attacks | ||||
| 25.02 | Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models | Arxiv | link | link |
| 24.10 | Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues | Arxiv | link | link |
| 24.08 | LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet | Arxiv | link | link |
| Reasoning-based Attacks | ||||
| 25.02 | A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos | Arxiv | link | - |
| 25.02 | H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models | Arxiv | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Harm CoT Data Curation | ||||
| 25.05 | SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | Arxiv | link | link |
| Safe CoT Data Curation | ||||
| 25.04 | STAR-1: Safer Alignment of Reasoning LLMs with 1K Data | Arxiv | link | link |
| 25.04 | RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Arxiv | link | link |
| 25.02 | SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities | Arxiv | link | link |
| SFT-based Safety Alignment on Reasoning | ||||
| 25.05 | SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | Arxiv | link | link |
| 25.04 | RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Arxiv | link | link |
| 25.03 | Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable | Arxiv | link | link |
| 25.02 | SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities | Arxiv | link | link |
| 25.02 | Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment | Arxiv | link | - |
| RL-based Safety Alignment on Reasoning | ||||
| 25.04 | SaRO: Enhancing LLM Safety through Reasoning-based Alignment | Arxiv | link | link |
| 25.02 | STAIR: Improving Safety Alignment with Introspective Reasoning | Arxiv | link | link |
| 25.02 | Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking | Arxiv | link | link |
| 24.12 | Deliberative Alignment: Reasoning Enables Safer Language Models | Arxiv | link | - |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Inference-time Scaling on Reasoning | ||||
| 25.01 | Trading Inference-Time Compute for Adversarial Robustness | Arxiv | link | - |
| Safe Decoding for Reasoning | ||||
| 25.05 | SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | Arxiv | link | link |
| 25.03 | Effectively Controlling Reasoning Models through Thinking Intervention | Arxiv | link | - |
| 25.02 | SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities | Arxiv | link | link |
| Time | Title | Venue | Paper | Code |
|---|---|---|---|---|
| Classifier-based Guard Model | ||||
| 25.03 | Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models | Arxiv | link | link |
| 25.01 | Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails | Arxiv | link | - |
| 24.11 | Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations | Arxiv | link | - |
| 24.07 | ShieldGemma: Generative AI Content Moderation Based on Gemma | Arxiv | link | link |
| 24.07 | (LLaMA Guard) The Llama 3 Herd of Models | Arxiv | link | - |
| 24.06 | WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs | Arxiv | link | link |
| Reasoning-based Guard Model | ||||
| 25.04 | X-Guard: Multilingual Guard Agent for Content Moderation | Arxiv | link | - |
| 25.02 | ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails | Arxiv | link | link |
| 25.01 | GuardReasoner: Towards Reasoning-based LLM Safeguards | Arxiv | link | link |

