Skip to content

Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning Models to enhance their security and reliability.

Notifications You must be signed in to change notification settings

WangCheng0116/Awesome-LRMs-Safety

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

28 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

arXiv GitHub stars GitHub Last commit


πŸ”₯ The paper has been accepted at EMNLP 25 Findings

This repository contains a carefully curated collection of papers discussed in our survey: "Safety in Large Reasoning Models: A Survey". As LRMs become increasingly powerful, understanding their safety implications becomes critical for responsible AI development. We created this resource to support researchers and practitioners working in this emerging field. If you find this repo useful for your work or research, it is really appreciated if you star this repository and cite our paper.

Reference

If you find this repository helpful for your research, we would greatly appreciate it if you could cite our papers. ✨

@misc{wang2025safetylargereasoningmodels,
      title={Safety in Large Reasoning Models: A Survey}, 
      author={Cheng Wang and Yue Liu and Baolong Bi and Duzhen Zhang and Zhong-Zhi Li and Yingwei Ma and Yufei He and Shengju Yu and Xinfeng Li and Junfeng Fang and Jiaheng Zhang and Bryan Hooi},
      year={2025},
      eprint={2504.17704},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.17704}, 
}

@article{liuyue_GuardReasoner,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Liu, Yue and Gao, Hongcheng and Zhai, Shengfang and Jun, Xia and Wu, Tianyi and Xue, Zhiwei and Chen, Yulin and Kawaguchi, Kenji and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2501.18492},
  year={2025}
}

@article{liuyue_GuardReasoner_VL,
  title={GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning},
  author={Liu, Yue and Zhai, Shengfang and Du, Mingzhe and Chen, Yulin and Cao, Tri and Gao, Hongcheng and Wang, Cheng and Li, Xinfeng and Wang, Kun and Fang, Junfeng and Zhang, Jiaheng and Hooi, Bryan},
  journal={arXiv preprint arXiv:2505.11049},
  year={2025}
}

πŸ“š What's Inside?

  • πŸ”¬ Cutting-edge research on LRMs vulnerabilities
  • πŸ› οΈ Novel attack methodologies against reasoning models
  • πŸ›‘οΈ Defense strategies and safety alignment techniques
  • πŸ”„ Regular updates as the field evolves

✨ How to Contribute

  • ⭐ Star this repository to show support
  • πŸ”€ Create a PR if you notice missing papers
  • πŸ“£ Share with the research community
Timeline overview
Taxonomy overview

Table of Content

Table of Contents

Safety Risks of LRMs

Harmful Request Compliance Risks

Time Title Venue Paper Code
25.04 DeepSeek-R1 Thoughtology: Let's about LLM Reasoning Arxiv link -
25.01 Early External Safety Testing of OpenAI's o3-mini: Insights from the Pre-Deployment Evaluation Arxiv link -
25.01 o3-mini vs DeepSeek-R1: Which One is Safer? Arxiv link -

Agentic Misbehavior Risks

Time Title Venue Paper Code
25.04 Emerging Cyber Attack Risks of Medical AI Agents Arxiv link -
25.02 Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents Arxiv link link
25.02 Demonstrating specification gaming in reasoning models Arxiv link -
25.02 Evaluating the Paperclip Maximizer: Are RL-Based Language Models More Likely to Pursue Instrumental Goals? Arxiv link link
25.01 Deception in LLMs: Self-Preservation and Autonomous Goals in Large Language Models Arxiv link -

Multi-lingual Safety Risks

Time Title Venue Paper Code
25.03 Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings Arxiv link link
25.03 Red Teaming Contemporary AI Models: Insights from Spanish and Basque Perspectives Arxiv link -
25.02 Safety Evaluation of DeepSeek Models in Chinese Contexts Arxiv link -

Multi-modal Safety Risks

Time Title Venue Paper Code
25.04 SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models Arxiv link link

Attacks on LRMs

Reasoning Length Attacks

Time Title Venue Paper Code
Overthinking
25.02 OverThink: Slowdown Attacks on Reasoning LLMs Arxiv link link
25.01 Trading Inference-Time Compute for Adversarial Robustness Arxiv link -
Underthinking
25.01 Trading Inference-Time Compute for Adversarial Robustness Arxiv link -

Answer Correctness Attacks

Time Title Venue Paper Code
Reasoning-based Backdoor Attacks
25.04 ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs Arxiv link -
25.02 BoT: Breaking Long Thought Processes of o1-like Large Language Models through Backdoor Attack Arxiv link link
25.01 DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs Arxiv link -
24.01 BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models Arxiv link -
Error Injection
25.03 Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps Arxiv link -

Prompt Injection Attacks

Time Title Venue Paper Code
25.02 The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1 Arxiv link -
25.02 Can Indirect Prompt Injection Attacks Be Detected and Removed? Arxiv link -
25.01 Trading Inference-Time Compute for Adversarial Robustness Arxiv link -
24.11 Defense Against Prompt Injection Attack by Leveraging Attack Techniques Arxiv link -

Jailbreak Attacks

Time Title Venue Paper Code
Prompt-based Attacks
25.04 SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models Arxiv link link
25.03 Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings Arxiv link link
24.07 Does Refusal Training in LLMs Generalize to the Past Tense? Arxiv link link
Multi-Turn Attacks
25.02 Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models Arxiv link link
24.10 Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues Arxiv link link
24.08 LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Arxiv link link
Reasoning-based Attacks
25.02 A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos Arxiv link -
25.02 H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models Arxiv link link

Defenses for LRMs

Safety Alignment of LRMs

Time Title Venue Paper Code
Harm CoT Data Curation
25.05 SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment Arxiv link link
Safe CoT Data Curation
25.04 STAR-1: Safer Alignment of Reasoning LLMs with 1K Data Arxiv link link
25.04 RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability Arxiv link link
25.02 SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities Arxiv link link
SFT-based Safety Alignment on Reasoning
25.05 SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment Arxiv link link
25.04 RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability Arxiv link link
25.03 Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable Arxiv link link
25.02 SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities Arxiv link link
25.02 Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment Arxiv link -
RL-based Safety Alignment on Reasoning
25.04 SaRO: Enhancing LLM Safety through Reasoning-based Alignment Arxiv link link
25.02 STAIR: Improving Safety Alignment with Introspective Reasoning Arxiv link link
25.02 Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking Arxiv link link
24.12 Deliberative Alignment: Reasoning Enables Safer Language Models Arxiv link -

Inference-time Defenses for LRMs

Time Title Venue Paper Code
Inference-time Scaling on Reasoning
25.01 Trading Inference-Time Compute for Adversarial Robustness Arxiv link -
Safe Decoding for Reasoning
25.05 SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment Arxiv link link
25.03 Effectively Controlling Reasoning Models through Thinking Intervention Arxiv link -
25.02 SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities Arxiv link link

Guard Models for LRMs

Time Title Venue Paper Code
Classifier-based Guard Model
25.03 Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models Arxiv link link
25.01 Aegis2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails Arxiv link -
24.11 Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations Arxiv link -
24.07 ShieldGemma: Generative AI Content Moderation Based on Gemma Arxiv link link
24.07 (LLaMA Guard) The Llama 3 Herd of Models Arxiv link -
24.06 WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs Arxiv link link
Reasoning-based Guard Model
25.04 X-Guard: Multilingual Guard Agent for Content Moderation Arxiv link -
25.02 ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails Arxiv link link
25.01 GuardReasoner: Towards Reasoning-based LLM Safeguards Arxiv link link

Contributors

wangcheng yueliu1999 junfeng

About

Official repository for "Safety in Large Reasoning Models: A Survey" - Exploring safety risks, attacks, and defenses for Large Reasoning Models to enhance their security and reliability.

Resources

Stars

Watchers

Forks

Contributors 5