Usage instructions: here
Table of Contents
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-07-23 | HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study | Mandar Pitale et.al. | 2507.17118 | null |
2025-07-22 | Towards Trustworthy AI: Secure Deepfake Detection using CNNs and Zero-Knowledge Proofs | H M Mohaimanul Islam et.al. | 2507.17010 | null |
2025-07-22 | Depth Gives a False Sense of Privacy: LLM Internal States Inversion | Tian Dong et.al. | 2507.16372 | null |
2025-07-19 | Combining Cost-Constrained Runtime Monitors for AI Safety | Tim Tian Hua et.al. | 2507.15886 | null |
2025-07-19 | When Autonomy Goes Rogue: Preparing for Risks of Multi-Agent Collusion in Social Systems | Qibing Ren et.al. | 2507.14660 | null |
2025-07-22 | Mapping the Parasocial AI Market: User Trends, Engagement and Risks | Zilan Qian et.al. | 2507.14226 | null |
2025-07-15 | Mitigating Trojanized Prompt Chains in Educational LLM Use Cases: Experimental Findings and Detection Tool Design | Richard M. Charles et.al. | 2507.14207 | null |
2025-07-23 | Fake or Real: The Impostor Hunt in Texts for Space Operations | Agata Kaczmarek et.al. | 2507.13508 | null |
2025-07-17 | Manipulation Attacks by Misaligned AI: Risk Analysis and Safety Case Framework | Rishane Dassanayake et.al. | 2507.12872 | null |
2025-07-16 | LLMs Encode Harmfulness and Refusal Separately | Jiachen Zhao et.al. | 2507.11878 | null |
2025-07-09 | The AI Shadow War: SaaS vs. Edge Computing Architectures | Rhea Pritham Marpu et.al. | 2507.11545 | null |
2025-07-15 | Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety | Tomek Korbak et.al. | 2507.11473 | null |
2025-07-14 | 3S-Attack: Spatial, Spectral and Semantic Invisible Backdoor Attack Against DNN Models | Jianyao Yin et.al. | 2507.10733 | null |
2025-07-16 | From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents | Tatiana Petrova et.al. | 2507.10644 | null |
2025-07-14 | Can You Detect the Difference? | İsmail Tarım et.al. | 2507.10475 | null |
2025-07-14 | BlueGlass: A Framework for Composite AI Safety | Harshal Nandigramwar et.al. | 2507.10106 | null |
2025-07-13 | Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications | Jia Yi Goh et.al. | 2507.09820 | null |
2025-07-12 | Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers | Santhosh Kumar Ravindran et.al. | 2507.09406 | null |
2025-07-06 | Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking | Aldan Creo et.al. | 2507.08014 | null |
2025-07-15 | Secure Cooperative Gradient Coding: Optimality, Reliability, and Global Privacy | Shudi Weng et.al. | 2507.07565 | null |
2025-07-09 | Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models | Aaron Dharna et.al. | 2507.06466 | null |
2025-07-08 | Humans overrely on overconfident language models, across languages | Neil Rathi et.al. | 2507.06306 | null |
2025-07-07 | Evaluating the Critical Risks of Amazon's Nova Premier under the Frontier Model Safety Framework | Satyapriya Krishna et.al. | 2507.06260 | null |
2025-07-08 | CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations | Xiaohu Li et.al. | 2507.06043 | null |
2025-07-08 | Domain adaptation of large language models for geotechnical applications | Lei Fan et.al. | 2507.05613 | null |
2025-07-07 | When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors | Scott Emmons et.al. | 2507.05246 | null |
2025-07-07 | Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Wei Duan et.al. | 2507.04673 | null |
2025-07-03 | From Turing to Tomorrow: The UK's Approach to AI Regulation | Oliver Ritchie et.al. | 2507.03050 | null |
2025-07-01 | `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts | Annika M Schoene et.al. | 2507.02990 | null |
2025-07-01 | GAF-Guard: An Agentic Framework for Risk Management and Governance in Large Language Models | Seshu Tirupathi et.al. | 2507.02986 | null |
2025-07-03 | Moral Responsibility or Obedience: What Do We Want from AI? | Joseph Boland et.al. | 2507.02788 | null |
2025-07-03 | Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks | Sizhe Chen et.al. | 2507.02735 | null |
2025-07-02 | How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks | Rahul Ramachandran et.al. | 2507.01955 | null |
2025-07-02 | Out-of-Distribution Detection Methods Answer the Wrong Questions | Yucen Lily Li et.al. | 2507.01831 | null |
2025-07-01 | SAFER: Probing Safety in Reward Models with Sparse Autoencoder | Sihang Li et.al. | 2507.00665 | null |
2025-06-30 | Thinking About Thinking: SAGE-nano's Inverse Reasoning for Self-Aware Language Models | Basab Jha et.al. | 2507.00092 | null |
2025-06-30 | Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments | Christoph Schnabl et.al. | 2506.23706 | null |
2025-06-30 | A New Perspective On AI Safety Through Control Theory Methodologies | Lars Ullrich et.al. | 2506.23703 | null |
2025-06-29 | Securing AI Systems: A Guide to Known Attacks and Impacts | Naoto Kiribuchi et.al. | 2506.23296 | null |
2025-06-28 | MPC in the Quantum Head (or: Superposition-Secure (Quantum) Zero-Knowledge) | Andrea Coladangelo et.al. | 2506.22961 | null |
2025-06-25 | Mitigating Gambling-Like Risk-Taking Behaviors in Large Language Models: A Behavioral Economics Approach to AI Safety | Y. Du et.al. | 2506.22496 | null |
2025-06-24 | Report on NSF Workshop on Science of Safe AI | Rajeev Alur et.al. | 2506.22492 | null |
2025-06-27 | A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety | Camille François et.al. | 2506.22183 | null |
2025-06-27 | SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation | Adam Goodge et.al. | 2506.21892 | null |
2025-06-30 | The Singapore Consensus on Global AI Safety Research Priorities | Yoshua Bengio et.al. | 2506.20702 | null |
2025-06-25 | Probing AI Safety with Source Code | Ujwal Narayan et.al. | 2506.20471 | null |
2025-06-24 | Persona Features Control Emergent Misalignment | Miles Wang et.al. | 2506.19823 | null |
2025-06-21 | AI Safety vs. AI Security: Demystifying the Distinction and Boundaries | Zhiqiang Lin et.al. | 2506.18932 | null |
2025-06-23 | How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models | Feng He et.al. | 2506.18428 | null |
2025-06-23 | LLM-Integrated Digital Twins for Hierarchical Resource Allocation in 6G Networks | Majumder Haider et.al. | 2506.18293 | null |
2025-06-22 | AI Through the Human Lens: Investigating Cognitive Theories in Machine Psychology | Akash Kundu et.al. | 2506.18156 | null |
2025-06-22 | Bugra Kilictas et.al. | 2506.18129 | null | |
2025-06-21 | Out of Control -- Why Alignment Needs Formal Control Theory (and an Alignment Control Stack) | Elija Perrier et.al. | 2506.17846 | null |
2025-06-20 | SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification | Zhenglin Lai et.al. | 2506.17368 | null |
2025-06-19 | PL-Guard: Benchmarking Language Model Safety for Polish | Aleksandra Krasnodębska et.al. | 2506.16322 | null |
2025-06-19 | Probing the Robustness of Large Language Models Safety to Latent Perturbations | Tianle Gu et.al. | 2506.16078 | link |
2025-06-18 | LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning | Gabrel J. Perin et.al. | 2506.15606 | link |
2025-06-17 | TriGuard: Testing Model Safety with Attribution Entropy, Verification, and Drift | Dipesh Tharu Mahato et.al. | 2506.14217 | link |
2025-06-17 | The Ethics of Generative AI in Anonymous Spaces: A Case Study of 4chan's /pol/ Board | Parth Gaba et.al. | 2506.14191 | null |
2025-06-17 | Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions | Junfeng Jiao et.al. | 2506.13510 | link |
2025-06-16 | Position: Certified Robustness Does Not (Yet) Imply Model Security | Andrew C. Cullen et.al. | 2506.13024 | null |
2025-06-15 | Intriguing Frequency Interpretation of Adversarial Robustness for CNNs and ViTs | Lu Chen et.al. | 2506.12875 | null |
2025-06-14 | OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics | Vineeth Dorna et.al. | 2506.12618 | link |
2025-06-14 | Tiered Agentic Oversight: A Hierarchical Multi-Agent System for AI Safety in Healthcare | Yubin Kim et.al. | 2506.12482 | null |
2025-06-13 | InfoFlood: Jailbreaking Large Language Models with Information Overload | Advait Yadav et.al. | 2506.12274 | null |
2025-06-13 | Hatevolution: What Static Benchmarks Don't Tell Us | Chiara Di Bonaventura et.al. | 2506.12148 | null |
2025-06-13 | Improving Large Language Model Safety with Contrastive Representation Learning | Samuel Simko et.al. | 2506.11938 | link |
2025-06-13 | Model Organisms for Emergent Misalignment | Edward Turner et.al. | 2506.11613 | null |
2025-06-12 | The Alignment Trap: Complexity Barriers | Jasper Yao et.al. | 2506.10304 | null |
2025-06-11 | Data-Centric Safety and Ethical Measures for Data and AI Governance | Srija Chakraborty et.al. | 2506.10217 | null |
2025-06-09 | LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges | Haoyang Li et.al. | 2506.10022 | link |
2025-06-08 | Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations | Zhiyu Xue et.al. | 2506.09067 | null |
2025-06-11 | Societal AI Research Has Become Less Interdisciplinary | Dror Kris Markus et.al. | 2506.08738 | null |
2025-06-11 | AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin | Shuo Yang et.al. | 2506.08473 | link |
2025-06-06 | Benchmarking Misuse Mitigation Against Covert Adversaries | Davis Brown et.al. | 2506.06414 | link |
2025-06-03 | Rational Superautotrophic Diplomacy (SupraAD); A Conceptual Framework for Alignment Based on Interdisciplinary Findings on the Fundamentals of Cognition | Andrea Morris et.al. | 2506.05389 | null |
2025-06-05 | Normative Conflicts and Shallow AI Alignment | Raphaël Millière et.al. | 2506.04679 | null |
2025-06-04 | Watermarking Degrades Alignment in Language Models: Analysis and Mitigation | Apurv Verma et.al. | 2506.04462 | link |
2025-06-04 | Misalignment or misuse? The AGI alignment tradeoff | Max Hellrigel-Holderbaum et.al. | 2506.03755 | null |
2025-06-04 | Bridging the Artificial Intelligence Governance Gap: The United States' and China's Divergent Approaches to Governing General-Purpose Artificial Intelligence | Oliver Guest et.al. | 2506.03497 | null |
2025-06-03 | MAEBE: Multi-Agent Emergent Behavior Framework | Sinem Erisken et.al. | 2506.03053 | null |
2025-06-02 | Trojan Horse Hunt in Time Series Forecasting for Space Operations | Krzysztof Kotowski et.al. | 2506.01849 | null |
2025-06-02 | ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs | Zeming Wei et.al. | 2506.01770 | link |
2025-06-02 | Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation | Yuan Gan et.al. | 2506.01591 | link |
2025-05-31 | Wide Reflective Equilibrium in LLM Alignment: Bridging Moral Epistemology and AI Safety | Matthew Brophy et.al. | 2506.00415 | null |
2025-05-30 | Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences | Mingqian Zheng et.al. | 2506.00195 | null |
2025-05-30 | Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment | Kundan Krishna et.al. | 2506.00166 | null |
2025-05-30 | TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | Xiaorui Wu et.al. | 2505.24672 | link |
2025-05-30 | Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | Utsav Maskey et.al. | 2505.24621 | null |
2025-05-30 | The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It | Zheng-Xin Yong et.al. | 2505.24119 | null |
2025-05-29 | OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities | Sahil Verma et.al. | 2505.23856 | link |
2025-05-27 | Watermarking Without Standards Is Not AI Governance | Alexander Nemecek et.al. | 2505.23814 | null |
2025-05-29 | SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents | Kunlun Zhu et.al. | 2505.23559 | link |
2025-05-29 | Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models | Mingyu Yu et.al. | 2505.23404 | null |
2025-05-28 | Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies | Chenruo Liu et.al. | 2505.22829 | null |
2025-05-28 | TensorShield: Safeguarding On-Device Inference by Shielding Critical DNN Tensors with TEE | Tong Sun et.al. | 2505.22735 | link |
2025-05-27 | Expert Survey: AI Reliability & Security Research Priorities | Joe O'Brien et.al. | 2505.21664 | null |
2025-05-27 | Preventing Adversarial AI Attacks Against Autonomous Situational Awareness: A Maritime Case Study | Mathew J. Walter et.al. | 2505.21609 | null |
2025-05-27 | SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge | Fengqing Jiang et.al. | 2505.21605 | null |
2025-05-26 | Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts | Hee-Seon Kim et.al. | 2505.21556 | null |
2025-05-27 | The Multilingual Divide and Its Impact on Global AI Safety | Aidan Peppin et.al. | 2505.21344 | null |
2025-05-27 | Red-Teaming Text-to-Image Systems by Rule-based Preference Modeling | Yichuan Cao et.al. | 2505.21074 | null |
2025-05-26 | VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration | Jiahui Geng et.al. | 2505.20362 | link |
2025-05-26 | What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs | Sangyeop Kim et.al. | 2505.19773 | null |
2025-05-25 | When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas | Steffen Backmann et.al. | 2505.19212 | link |
2025-05-25 | GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | Zixuan Chen et.al. | 2505.18979 | null |
2025-05-24 | Guided by Guardrails: Control Barrier Functions as Safety Instructors for Robotic Learning | Maeva Guerrier et.al. | 2505.18858 | null |
2025-05-24 | Safety Alignment via Constrained Knowledge Unlearning | Zesheng Shi et.al. | 2505.18588 | null |
2025-05-23 | Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary | Licheng Pan et.al. | 2505.18325 | null |
2025-05-23 | Surfacing Semantic Orthogonality Across Model Safety Benchmarks: A Multi-Dimensional Analysis | Jonathan Bennion et.al. | 2505.17636 | null |
2025-05-23 | Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models | Jiawei Kong et.al. | 2505.17601 | null |
2025-05-20 | From nuclear safety to LLM security: Applying non-probabilistic risk management strategies to build safe and secure LLM-powered systems | Alexander Gutfraind et.al. | 2505.17084 | null |
2025-05-22 | When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | Jianing Geng et.al. | 2505.16765 | null |
2025-05-22 | Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization | Chengcan Wu et.al. | 2505.16737 | link |
2025-05-21 | Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack | Silvia Cappelletti et.al. | 2505.15323 | null |
2025-05-20 | Foundations of Unknown-aware Machine Learning | Xuefeng Du et.al. | 2505.14933 | null |
2025-05-20 | Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas | Yu Ying Chiu et.al. | 2505.14633 | link |
2025-05-19 | Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations | Li Ji-An et.al. | 2505.13763 | null |
2025-05-16 | Noise Injection Systemically Degrades Large Language Model Safety Guardrails | Prithviraj Singh Shahani et.al. | 2505.13500 | null |
2025-05-19 | Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities | Lili Zhang et.al. | 2505.13195 | null |
2025-05-19 | Bullying the Machine: How Personas Increase LLM Vulnerability | Ziwei Xu et.al. | 2505.12692 | null |
2025-05-18 | Persuasion and Safety in the Era of Generative AI | Haein Kong et.al. | 2505.12248 | null |
2025-05-17 | Position Paper: Bounded Alignment: What (Not) To Expect From AGI Agents | Ali A. Minai et.al. | 2505.11866 | null |
2025-05-16 | Probing the Vulnerability of Large Language Models to Polysemantic Interventions | Bofan Gong et.al. | 2505.11611 | null |
2025-05-16 | Illusion or Algorithm? Investigating Memorization, Emergence, and Symbolic Processing in In-Context Learning | Jingcheng Niu et.al. | 2505.11004 | link |
2025-05-15 | Formalising Human-in-the-Loop: Computational Reductions, Failure Modes, and Legal-Moral Responsibility | Maurice Chiodo et.al. | 2505.10426 | null |
2025-05-15 | Dark LLMs: The Growing Threat of Unaligned AI Models | Michael Fire et.al. | 2505.10066 | null |
2025-05-15 | Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | Adel ElZemity et.al. | 2505.09974 | null |
2025-05-14 | Access Controls Will Solve the Dual-Use Dilemma | EvĹľen Wybitul et.al. | 2505.09341 | null |
2025-05-16 | SecReEvalBench: A Multi-turned Security Resilience Evaluation Benchmark for Large Language Models | Huining Cui et.al. | 2505.07584 | null |
2025-05-09 | Offensive Security for AI Systems: Concepts, Practices, and Applications | Josh Harguess et.al. | 2505.06380 | null |
2025-05-08 | Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods | Markov Grey et.al. | 2505.05541 | null |
2025-05-08 | Reasoning Models Don't Always Say What They Think | Yanda Chen et.al. | 2505.05410 | null |
2025-05-08 | Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation | Luca Marzari et.al. | 2505.05235 | null |
2025-05-08 | Belief Filtering for Epistemic Control in Linguistic State Space | Sebastian Dumbrava et.al. | 2505.04927 | null |
2025-05-07 | The Aloe Family Recipe for Open and Specialized Healthcare LLMs | Dario Garcia-Gasulla et.al. | 2505.04388 | null |
2025-05-07 | Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety | Variath Madhupal Gautham Nair et.al. | 2505.04146 | null |
2025-05-08 | An alignment safety case sketch based on debate | Marie Davidsen Buhl et.al. | 2505.03989 | null |
2025-05-05 | What Is AI Safety? What Do We Want It to Be? | Jacqueline Harding et.al. | 2505.02313 | null |
2025-05-04 | Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents | Christian Schroeder de Witt et.al. | 2505.02077 | null |
2025-05-03 | Third-party compliance reviews for frontier AI safety frameworks | Aidan Homewood et.al. | 2505.01643 | null |
2025-05-02 | Securing the Future of IVR: AI-Driven Innovation with Agile Security, Data Regulation, and Ethical AI Integration | Khushbu Mehboob Shaikh et.al. | 2505.01514 | null |
2025-04-30 | A Domain-Agnostic Scalable AI Safety Ensuring Framework | Beomjun Kim et.al. | 2504.20924 | null |
2025-04-29 | When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines | Sachin R. Pendse et.al. | 2504.20910 | null |
2025-04-25 | AI Awareness | Xiaojian Li et.al. | 2504.20084 | null |
2025-04-28 | Mitigating Societal Cognitive Overload in the Age of AI: Challenges and Directions | Salem Lahlou et.al. | 2504.19990 | null |
2025-05-02 | Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents | Vineeth Sai Narajala et.al. | 2504.19956 | null |
2025-04-28 | AI Alignment in Medical Imaging: Unveiling Hidden Biases Through Counterfactual Analysis | Haroui Ma et.al. | 2504.19621 | link |
2025-04-26 | Latent Adversarial Training Improves the Representation of Refusal | Alexandra Abbas et.al. | 2504.18872 | null |
2025-04-25 | AI Safety Assurance for Automated Vehicles: A Survey on Research, Standardization, Regulation | Lars Ullrich et.al. | 2504.18328 | null |
2025-04-25 | RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models | Bang An et.al. | 2504.18041 | null |
2025-04-17 | Security-First AI: Foundations for Robust and Trustworthy Systems | Krti Tallam et.al. | 2504.16110 | null |
2025-04-21 | Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds | Heidy Khlaaf et.al. | 2504.15088 | null |
2025-04-20 | A Byzantine Fault Tolerance Approach towards AI Safety | John deVadoss et.al. | 2504.14668 | null |
2025-04-20 | Seeing Through Risk: A Symbolic Approximation of Prospect Theory | Ali Arslan Yousaf et.al. | 2504.14448 | null |
2025-04-16 | AI Safety Should Prioritize the Future of Work | Sanchaita Hazra et.al. | 2504.13959 | null |
2025-04-17 | In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate? | Ben Bucknall et.al. | 2504.12914 | null |
2025-04-16 | Secure Transfer Learning: Training Clean Models Against Backdoor in (Both) Pre-trained Encoders and Downstream Datasets | Yechao Zhang et.al. | 2504.11990 | null |
2025-04-14 | The Jailbreak Tax: How Useful are Your Jailbreak Outputs? | Kristina Nikolić et.al. | 2504.10694 | link |
2025-04-14 | Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? | Yanbo Wang et.al. | 2504.10000 | null |
2025-04-13 | The Structural Safety Generalization Problem | Julius Broomfield et.al. | 2504.09712 | link |
2025-04-13 | Mitigating Many-Shot Jailbreaking | Christopher M. Ackerman et.al. | 2504.09604 | null |
2025-04-10 | Geneshift: Impact of different scenario shift on Jailbreaking LLM | Tianyi Wu et.al. | 2504.08104 | null |
2025-04-10 | The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search | Yutaro Yamada et.al. | 2504.08066 | link |
2025-04-10 | Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Riccardo Cantini et.al. | 2504.07887 | link |
2025-04-07 | Following the Whispers of Values: Unraveling Neural Mechanisms Behind Value-Oriented Behaviors in LLMs | Ling Hu et.al. | 2504.04994 | null |
2025-04-05 | Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability | Vishnu Kabir Chhabra et.al. | 2504.04215 | null |
2025-04-05 | Among Us: A Sandbox for Agentic Deception | Satvik Golechha et.al. | 2504.04072 | null |
2025-04-03 | Improving Harmful Text Detection with Joint Retrieval and External Knowledge | Zidong Yu et.al. | 2504.02310 | null |
2025-04-02 | Reinsuring AI: Energy, Agriculture, Finance & Medicine as Precedents for Scalable Governance of Frontier Artificial Intelligence | Nicholas Stetler et.al. | 2504.02127 | null |
2025-03-28 | A Framework for Cryptographic Verifiability of End-to-End AI Pipelines | Kar Balan et.al. | 2503.22573 | null |
2025-03-28 | Effective Automation to Support the Human Infrastructure in AI Red Teaming | Alice Qian Zhang et.al. | 2503.22116 | null |
2025-03-28 | Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories | Yazhou Zhang et.al. | 2503.22115 | null |
2025-03-31 | MAD Chairs: A new tool to evaluate AI | Chris Santos-Lang et.al. | 2503.20986 | null |
2025-03-26 | The Backfiring Effect of Weak AI Safety Regulation | Benjamin Laufer et.al. | 2503.20848 | null |
2025-03-26 | AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges | Haoyu Gao et.al. | 2503.19444 | null |
2025-03-18 | International Agreements on AI Safety: Review and Recommendations for a Conditional AI Safety Treaty | Rebecca Scholefield et.al. | 2503.18956 | null |
2025-03-22 | Intelligence Sequencing and the Path-Dependence of Intelligence Evolution: AGI-First vs. DCI-First as Irreversible Attractors | Andy E. Williams et.al. | 2503.17688 | null |
2025-03-17 | AI Companies Should Report Pre- and Post-Mitigation Safety Evaluations | Dillon Bowen et.al. | 2503.17388 | null |
2025-03-18 | Temporal Context Awareness: A Defense Framework Against Multi-turn Manipulation Attacks on Large Language Models | Prashant Kulkarni et.al. | 2503.15560 | link |
2025-03-19 | A Peek Behind the Curtain: Using Step-Around Prompt Engineering to Identify Bias and Misinformation in GenAI Models | Don Hickerson et.al. | 2503.15205 | null |
2025-03-17 | ProDiF: Protecting Domain-Invariant Features to Secure Pre-Trained Models Against Extraction | Tong Zhou et.al. | 2503.13224 | null |
2025-03-17 | Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering | Kenneth J. K. Ong et.al. | 2503.12722 | null |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-07-21 | Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems | Andrii Balashov et.al. | 2507.15613 | null |
2025-07-21 | QSAF: A Novel Mitigation Framework for Cognitive Degradation in Agentic AI | Hammad Atta et.al. | 2507.15330 | null |
2025-07-21 | PromptArmor: Simple yet Effective Prompt Injection Defenses | Tianneng Shi et.al. | 2507.15219 | null |
2025-07-20 | DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection | Jerry Wang et.al. | 2507.15042 | null |
2025-07-20 | AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning | Yi Zhang et.al. | 2507.14987 | null |
2025-07-20 | Hierarchical Cross-modal Prompt Learning for Vision-Language Models | Hao Zheng et.al. | 2507.14976 | null |
2025-07-20 | Strategic Integration of AI Chatbots in Physics Teacher Preparation: A TPACK-SWOT Analysis of Pedagogical, Epistemic, and Cybersecurity Dimensions | N. Mohammadipour et.al. | 2507.14860 | null |
2025-07-20 | Manipulating LLM Web Agents with Indirect Prompt Injection Attack via HTML Accessibility Tree | Sam Johnson et.al. | 2507.14799 | null |
2025-07-18 | Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models | Palash Nandi et.al. | 2507.13761 | null |
2025-07-18 | TopicAttack: An Indirect Prompt Injection Attack via Topic Transition | Yulin Chen et.al. | 2507.13686 | null |
2025-07-17 | Paper Summary Attack: Jailbreaking LLMs through LLM Safety Papers | Liang Lin et.al. | 2507.13474 | null |
2025-07-17 | Prompt Injection 2.0: Hybrid AI Threats | Jeremy McHugh et.al. | 2507.13169 | null |
2025-07-17 | MAD-Spear: A Conformity-Driven Prompt Injection Attack on Multi-Agent Debate Systems | Yu Cui et.al. | 2507.13038 | null |
2025-07-16 | Exploiting Jailbreaking Vulnerabilities in Generative AI to Bypass Ethical Safeguards for Facilitating Phishing Attacks | Rina Mishra et.al. | 2507.12185 | null |
2025-07-16 | LLMs Encode Harmfulness and Refusal Separately | Jiachen Zhao et.al. | 2507.11878 | null |
2025-07-15 | Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility | Brendan Murphy et.al. | 2507.11630 | null |
2025-07-14 | ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning | Zhengyue Zhao et.al. | 2507.11500 | null |
2025-07-15 | The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Zichen Wen et.al. | 2507.11097 | null |
2025-07-17 | SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems | Wenliang Shan et.al. | 2507.08898 | null |
2025-07-10 | A Dynamic Stackelberg Game Framework for Agentic AI Defense Against LLM Jailbreaking | Zhengye Han et.al. | 2507.08207 | null |
2025-07-10 | Defending Against Prompt Injection With a Few DefensiveTokens | Sizhe Chen et.al. | 2507.07974 | null |
2025-07-10 | GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing | Peiyan Zhang et.al. | 2507.07735 | null |
2025-07-10 | May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks | Nishit V. Pandya et.al. | 2507.07417 | null |
2025-07-09 | An attention-aware GNN-based input defender against multi-turn jailbreak on LLMs | Zixuan Huang et.al. | 2507.07146 | null |
2025-07-11 | The Dark Side of LLMs Agent-based Attacks for Complete Computer Takeover | Matteo Lupinacci et.al. | 2507.06850 | null |
2025-07-09 | On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks | Stephen Obadinma et.al. | 2507.06489 | null |
2025-07-09 | Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models | Aaron Dharna et.al. | 2507.06466 | null |
2025-07-08 | Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms | Tarek Gasmi et.al. | 2507.06323 | null |
2025-07-08 | The bitter lesson of misuse detection | Hadrien Mariaccia et.al. | 2507.06282 | null |
2025-07-08 | Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review | Zhicheng Lin et.al. | 2507.06185 | null |
2025-07-08 | CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations | Xiaohu Li et.al. | 2507.06043 | null |
2025-07-08 | TuneShield: Mitigating Toxicity in Conversational AI while Fine-tuning on Untrusted Data | Aravind Cheruvu et.al. | 2507.05660 | null |
2025-07-08 | How Not to Detect Prompt Injections with an LLM | Sarthak Choudhary et.al. | 2507.05630 | null |
2025-07-07 | A Systematization of Security Vulnerabilities in Computer Use Agents | Daniel Jones et.al. | 2507.05445 | null |
2025-07-07 | Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models | Ziqi Miao et.al. | 2507.05248 | null |
2025-07-07 | Trojan Horse Prompting: Jailbreaking Conversational Multimodal Models by Forging Assistant Message | Wei Duan et.al. | 2507.04673 | null |
2025-07-06 | Tail-aware Adversarial Attacks: A Distributional Approach to Efficient LLM Jailbreaking | Tim Beyer et.al. | 2507.04446 | null |
2025-07-06 | Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs | Xiaomeng Hu et.al. | 2507.04365 | null |
2025-07-04 | On Jailbreaking Quantized Language Models Through Fault Injection Attacks | Noureldin Zahran et.al. | 2507.03236 | null |
2025-07-03 | Adversarial Manipulation of Reasoning Models using Internal Representations | Kureha Yamaguchi et.al. | 2507.03167 | null |
2025-07-03 | LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users | Almog Hilel et.al. | 2507.02850 | null |
2025-07-03 | Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection | Ziqi Miao et.al. | 2507.02844 | null |
2025-07-03 | Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models | Riccardo Cantini et.al. | 2507.02799 | null |
2025-07-03 | Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks | Sizhe Chen et.al. | 2507.02735 | null |
2025-07-03 | PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage | Krishna Kanth Nakka et.al. | 2507.02332 | null |
2025-07-02 | MGC: A Compiler Framework Exploiting Compositional Blindness in Aligned LLMs for Malware Generation | Lu Yan et.al. | 2507.02057 | null |
2025-07-02 | SafePTR: Token-Level Jailbreak Defense in Multimodal LLMs via Prune-then-Restore Mechanism | Beitao Chen et.al. | 2507.01513 | null |
2025-07-01 | Reasoning as an Adaptive Defense for Safety | Taeyoun Kim et.al. | 2507.00971 | null |
2025-07-01 | SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents | Siyuan Liang et.al. | 2507.00841 | null |
2025-07-02 | Transferable Modeling Strategies for Low-Resource LLM Tasks: A Prompt and Alignment-Based Approach | Shuangquan Lyu et.al. | 2507.00601 | null |
2025-06-30 | Linearly Decoding Refused Knowledge in Aligned Language Models | Aryan Shrivastava et.al. | 2507.00239 | null |
2025-06-30 | Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models | Tung-Ling Li et.al. | 2506.24056 | null |
2025-06-30 | Leveraging the Potential of Prompt Engineering for Hate Speech Detection in Low-Resource Languages | Ruhina Tabasshum Prome et.al. | 2506.23930 | null |
2025-06-30 | Evaluating Multi-Agent Defences Against Jailbreaking Attacks on Large Language Models | Maria Carolina Cornelia Wit et.al. | 2506.23576 | null |
2025-06-29 | From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows | Mohamed Amine Ferrag et.al. | 2506.23260 | null |
2025-06-28 | Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models | Younwoo Choi et.al. | 2506.22957 | null |
2025-06-27 | VERA: Variational Inference Framework for Jailbreaking Large Language Models | Anamika Lochab et.al. | 2506.22666 | null |
2025-06-27 | MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs | Boyuan Chen et.al. | 2506.22557 | null |
2025-07-01 | Red Teaming for Generative AI, Report on a Copyright-Focused Exercise Completed in an Academic Medical Center | James Wen et.al. | 2506.22523 | null |
2025-06-27 | A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety | Camille François et.al. | 2506.22183 | null |
2025-06-27 | Advancing Jailbreak Strategies: A Hybrid Approach to Exploiting LLM Vulnerabilities and Bypassing Modern Defenses | Mohamed Ahmed et.al. | 2506.21972 | null |
2025-06-24 | PrivacyXray: Detecting Privacy Breaches in LLMs through Semantic Consistency and Probability Certainty | Jinwen He et.al. | 2506.19563 | null |
2025-06-24 | MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models | Yinan Xia et.al. | 2506.19257 | null |
2025-06-23 | Command-V: Pasting LLM Behaviors via Activation Profiles | Barry Wang et.al. | 2506.19140 | null |
2025-06-23 | Enhancing Security in LLM Applications: A Performance Evaluation of Early Detection Systems | Valerii Gakh et.al. | 2506.19109 | null |
2025-06-23 | Security Assessment of DeepSeek and GPT Series Models against Jailbreak Attacks | Xiaodong Wu et.al. | 2506.18543 | null |
2025-06-23 | NSFW-Classifier Guided Prompt Sanitization for Safe Text-to-Image Generation | Yu Xie et.al. | 2506.18325 | null |
2025-06-22 | Multi-turn Jailbreaking via Global Refinement and Active Fabrication | Hua Tang et.al. | 2506.17881 | null |
2025-06-20 | Semantic-Aware Parsing for Security Logs | Julien Piet et.al. | 2506.17512 | null |
2025-06-20 | From Concepts to Components: Concept-Agnostic Attention Module Discovery in Transformers | Jingtong Su et.al. | 2506.17052 | null |
2025-06-20 | MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning | Muyang Zheng et.al. | 2506.16792 | null |
2025-06-20 | Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models | Lei Jiang et.al. | 2506.16760 | null |
2025-06-19 | Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models | Biao Yi et.al. | 2506.16447 | null |
2025-06-19 | Probing the Robustness of Large Language Models Safety to Latent Perturbations | Tianle Gu et.al. | 2506.16078 | link |
2025-06-18 | Sysformer: Safeguarding Frozen Large Language Models with Adaptive System Prompts | Kartik Sharma et.al. | 2506.15751 | null |
2025-06-18 | Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers | Tommaso Green et.al. | 2506.15674 | link |
2025-06-18 | From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem | Yanxu Mao et.al. | 2506.15170 | null |
2025-06-17 | OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents | Thomas Kuntz et.al. | 2506.14866 | link |
2025-06-17 | AIRTBench: Measuring Autonomous AI Red Teaming Capabilities in Language Models | Ads Dawson et.al. | 2506.14682 | link |
2025-06-16 | Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations | Abhilekh Borah et.al. | 2506.13901 | null |
2025-06-17 | Safe-Child-LLM: A Developmental Benchmark for Evaluating LLM Safety in Child-LLM Interactions | Junfeng Jiao et.al. | 2506.13510 | link |
2025-06-15 | Jailbreak Strength and Model Similarity Predict Transferability | Rico Angell et.al. | 2506.12913 | null |
2025-06-15 | Universal Jailbreak Suffixes Are Strong Attention Hijackers | Matan Ben-Tov et.al. | 2506.12880 | link |
2025-06-15 | SecurityLingua: Efficient Defense of LLM Jailbreak Attacks via Security-Aware Prompt Compression | Yucheng Li et.al. | 2506.12707 | null |
2025-06-15 | Alphabet Index Mapping: Jailbreaking LLMs through Semantic Dissimilarity | Bilal Saleh Husain et.al. | 2506.12685 | null |
2025-06-14 | Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 | Zonghao Ying et.al. | 2506.12430 | link |
2025-06-14 | Exploring the Secondary Risks of Large Language Models | Jiawei Chen et.al. | 2506.12382 | null |
2025-06-14 | QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety | Taegyeong Lee et.al. | 2506.12299 | null |
2025-06-13 | InfoFlood: Jailbreaking Large Language Models with Information Overload | Advait Yadav et.al. | 2506.12274 | null |
2025-06-13 | Investigating Vulnerabilities and Defenses Against Audio-Visual Attacks: A Comprehensive Survey Emphasizing Multimodal Models | Jinming Wen et.al. | 2506.11521 | null |
2025-06-12 | How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? | Sohee Yang et.al. | 2506.10979 | null |
2025-06-12 | SoK: Evaluating Jailbreak Guardrails for Large Language Models | Xunguang Wang et.al. | 2506.10597 | link |
2025-06-10 | Evaluation empirique de la sécurisation et de l'alignement de ChatGPT et Gemini: analyse comparative des vulnérabilités par expérimentations de jailbreaks | Rafaël Nouailles et.al. | 2506.10029 | null |
2025-06-09 | LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges | Haoyang Li et.al. | 2506.10022 | link |
2025-06-11 | LLMail-Inject: A Dataset from a Realistic Adaptive Prompt Injection Challenge | Sahar Abdelnabi et.al. | 2506.09956 | link |
2025-06-11 | Effective Red-Teaming of Policy-Adherent Agents | Itay Nakash et.al. | 2506.09600 | null |
2025-06-11 | AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI) | Danush Khanna et.al. | 2506.08885 | null |
2025-06-11 | Design Patterns for Securing LLM Agents against Prompt Injections | Luca Beurer-Kellner et.al. | 2506.08837 | null |
2025-06-09 | TokenBreak: Bypassing Text Classification Models Through Token Manipulation | Kasimir Schulz et.al. | 2506.07948 | null |
2025-06-11 | RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards | Jingnan Zheng et.al. | 2506.07736 | null |
2025-06-09 | Evaluating LLMs Robustness in Less Resourced Languages with Proxy Models | Maciej ChrabÄ…szcz et.al. | 2506.07645 | null |
2025-06-09 | TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts | Torsten KrauĂź et.al. | 2506.07596 | null |
2025-06-09 | When Style Breaks Safety: Defending Language Models Against Superficial Style Alignment | Yuxin Xiao et.al. | 2506.07452 | link |
2025-06-09 | Beyond Jailbreaks: Revealing Stealthier and Broader LLM Security Risks Stemming from Alignment Failures | Yukai Zhou et.al. | 2506.07402 | null |
2025-06-08 | AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint | Leheng Sheng et.al. | 2506.07022 | link |
2025-06-10 | Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test | Xiaoyuan Zhu et.al. | 2506.06975 | null |
2025-06-06 | Saffron-1: Towards an Inference Scaling Paradigm for LLM Safety Assurance | Ruizhong Qiu et.al. | 2506.06444 | link |
2025-06-06 | Small Models, Big Support: A Local LLM Framework for Teacher-Centric Content Creation and Assessment using RAG and CAG | Zarreen Reza et.al. | 2506.05925 | null |
2025-06-06 | To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt | Zhilong Wang et.al. | 2506.05739 | null |
2025-06-05 | Sentinel: SOTA model to protect against prompt injections | Dror Ivry et.al. | 2506.05446 | null |
2025-06-05 | Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets | Lei Hsiung et.al. | 2506.05346 | null |
2025-06-05 | HoliSafe: Holistic Safety Benchmarking and Modeling with Safety Meta Token for Vision-Language Model | Youngwan Lee et.al. | 2506.04704 | null |
2025-06-06 | TracLLM: A Generic Framework for Attributing Long Context LLMs | Yanting Wang et.al. | 2506.04202 | link |
2025-06-03 | Adversarial Attacks on Robotic Vision Language Action Models | Eliot Krzysztof Jones et.al. | 2506.03350 | link |
2025-06-03 | It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics | Matthew Kowal et.al. | 2506.02873 | null |
2025-06-03 | ATAG: AI-Agent Application Threat Assessment with Attack Graphs | Parth Atulbhai Gandhi et.al. | 2506.02859 | null |
2025-06-03 | From Prompts to Protection: Large Language Model-Enabled In-Context Learning for Smart Public Safety UAV | Yousef Emami et.al. | 2506.02649 | null |
2025-06-03 | BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage | Kalyan Nakka et.al. | 2506.02479 | link |
2025-06-03 | VPI-Bench: Visual Prompt Injection Attacks for Computer-Use Agents | Tri Cao et.al. | 2506.02456 | link |
2025-06-02 | ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs | Zeming Wei et.al. | 2506.01770 | link |
2025-06-02 | Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models | Youze Wang et.al. | 2506.01307 | null |
2025-06-01 | Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution | Meysam Alizadeh et.al. | 2506.01055 | null |
2025-06-01 | Predicting Empirical AI Research Outcomes with Language Models | Jiaxin Wen et.al. | 2506.00794 | null |
2025-06-01 | Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning | Weiyang Guo et.al. | 2506.00782 | null |
2025-05-30 | TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis | Xiaorui Wu et.al. | 2505.24672 | link |
2025-05-30 | Benchmarking Large Language Models for Cryptanalysis and Mismatched-Generalization | Utsav Maskey et.al. | 2505.24621 | null |
2025-05-30 | AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders | Yuqi Zhang et.al. | 2505.24519 | null |
2025-05-30 | Model Unlearning via Sparse Autoencoder Subspace Guided Projections | Xu Wang et.al. | 2505.24428 | null |
2025-05-30 | From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models | Haibo Jin et.al. | 2505.24232 | null |
2025-05-30 | SentinelAgent: Graph-based Anomaly Detection in Multi-Agent Systems | Xu He et.al. | 2505.24201 | null |
2025-05-29 | LLM Agents Should Employ Security Principles | Kaiyuan Zhang et.al. | 2505.24019 | null |
2025-05-29 | Securing AI Agents with Information-Flow Control | Manuel Costa et.al. | 2505.23643 | link |
2025-05-29 | Understanding Refusal in Language Models with Sparse Autoencoders | Wei Jie Yeo et.al. | 2505.23556 | link |
2025-05-29 | Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models | Mingyu Yu et.al. | 2505.23404 | null |
2025-05-28 | Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment | Krti Tallam et.al. | 2505.22852 | null |
2025-05-28 | Adaptive Detoxification: Safeguarding General Capabilities of LLMs through Toxicity-Aware Knowledge Editing | Yifan Lu et.al. | 2505.22298 | null |
2025-05-28 | Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models | Yongcan Yu et.al. | 2505.22271 | null |
2025-05-28 | Jailbreak Distillation: Renewable Safety Benchmarking | Jingyu Zhang et.al. | 2505.22037 | null |
2025-05-28 | RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments | Zeyi Liao et.al. | 2505.21936 | link |
2025-05-27 | Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation | Tharindu Kumarage et.al. | 2505.21784 | null |
2025-05-26 | Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts | Hee-Seon Kim et.al. | 2505.21556 | null |
2025-05-28 | Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space | Yao Huang et.al. | 2505.21277 | link |
2025-05-27 | Improved Representation Steering for Language Models | Zhengxuan Wu et.al. | 2505.20809 | link |
2025-05-26 | Holes in Latent Space: Topological Signatures Under Adversarial Influence | Aideen Fay et.al. | 2505.20435 | null |
2025-05-26 | Lifelong Safety Alignment for Language Models | Haoyu Wang et.al. | 2505.20259 | link |
2025-05-26 | Capability-Based Scaling Laws for LLM Red-Teaming | Alexander Panfilov et.al. | 2505.20162 | link |
2025-05-26 | Attention! You Vision Language Model Could Be Maliciously Manipulated | Xiaosen Wang et.al. | 2505.19911 | null |
2025-05-26 | What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs | Sangyeop Kim et.al. | 2505.19773 | null |
2025-05-26 | SGM: A Framework for Building Specification-Guided Moderation Filters | Masoomali Fatehkia et.al. | 2505.19766 | null |
2025-05-26 | VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models | Bingrui Sima et.al. | 2505.19684 | null |
2025-05-26 | JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models | Jiaxin Song et.al. | 2505.19610 | null |
2025-05-25 | GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization | Zixuan Chen et.al. | 2505.18979 | null |
2025-05-25 | Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations | Sanjay Kariyappa et.al. | 2505.18907 | null |
2025-05-24 | Security Concerns for Large Language Models: A Survey | Miles Q. Li et.al. | 2505.18889 | null |
2025-05-24 | Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework | Binhao Ma et.al. | 2505.18864 | link |
2025-05-23 | Survival Games: Human-LLM Strategic Showdowns under Severe Resource Scarcity | Zhihong Chen et.al. | 2505.17937 | link |
2025-05-23 | Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking? | Chengda Lu et.al. | 2505.17650 | null |
2025-05-23 | Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models | Jiawei Kong et.al. | 2505.17601 | null |
2025-05-23 | One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs | Linbao Li et.al. | 2505.17598 | link |
2025-05-23 | JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models | Zifan Peng et.al. | 2505.17568 | link |
2025-05-23 | Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models | Wenhan Chang et.al. | 2505.17519 | null |
2025-05-22 | Refusal Direction is Universal Across Safety-Aligned Languages | Xinpeng Wang et.al. | 2505.17306 | null |
2025-05-22 | In-Context Watermarks for Large Language Models | Yepeng Liu et.al. | 2505.16934 | null |
2025-05-22 | When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques | Jianing Geng et.al. | 2505.16765 | null |
2025-05-23 | Finetuning-Activated Backdoors in LLMs | Thibaud Gloaguen et.al. | 2505.16567 | link |
2025-05-22 | Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models | Zhaoxin Wang et.al. | 2505.16446 | null |
2025-05-22 | Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers | Viet-Anh Nguyen et.al. | 2505.16241 | null |
2025-05-22 | SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning | Kaiwen Zhou et.al. | 2505.16186 | null |
2025-05-21 | Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval | Taiye Chen et.al. | 2505.15753 | null |
2025-05-21 | Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses | Xiaoxue Yang et.al. | 2505.15738 | link |
2025-05-21 | Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries | Yuhao Wang et.al. | 2505.15420 | null |
2025-05-21 | Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models | Zirui Song et.al. | 2505.15406 | link |
2025-05-20 | SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment | Wonje Jeung et.al. | 2505.14667 | null |
2025-05-20 | sudoLLM : On Multi-role Alignment of Language Models | Soumadeep Saha et.al. | 2505.14607 | null |
2025-05-20 | Can Large Language Models Really Recognize Your Name? | Dzung Pham et.al. | 2505.14549 | link |
2025-05-20 | Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders | Agam Goyal et.al. | 2505.14536 | null |
2025-05-20 | Lessons from Defending Gemini Against Indirect Prompt Injections | Chongyang Shi et.al. | 2505.14534 | null |
2025-05-20 | Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs | Jiawen Wang et.al. | 2505.14368 | null |
2025-05-20 | Exploring Jailbreak Attacks on LLMs through Intent Concealment and Diversion | Tiehan Cui et.al. | 2505.14316 | null |
2025-05-20 | EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection | Yijie Lu et.al. | 2505.14289 | null |
2025-05-20 | "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs | Darpan Aswal et.al. | 2505.14226 | null |
2025-05-20 | AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models | Guangke Chen et.al. | 2505.14103 | null |
2025-05-19 | Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks | Narek Maloyan et.al. | 2505.13348 | null |
2025-05-19 | I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models | Alice Plebe et.al. | 2505.13302 | link |
2025-05-19 | The Hidden Dangers of Browsing AI Agents | Mykyta Mudryi et.al. | 2505.13076 | null |
2025-05-18 | BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation | Wenqi Lyu et.al. | 2505.12443 | null |
2025-05-18 | CAPTURE: Context-Aware Prompt Injection Testing and Robustness Enhancement | Gauri Kholkar et.al. | 2505.12368 | null |
2025-05-18 | The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models | Linghan Huang et.al. | 2505.12287 | null |
2025-05-17 | Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement | Peng Ding et.al. | 2505.12060 | link |
2025-05-17 | Multilingual Collaborative Defense for Large Language Models | Hongliang Li et.al. | 2505.11835 | link |
2025-05-17 | JULI: Jailbreak Large Language Models by Self-Introspection | Jesson Wang et.al. | 2505.11790 | null |
2025-05-16 | EnvInjection: Environmental Prompt Injection Attack to Multi-modal Web Agents | Xilong Wang et.al. | 2505.11717 | null |
2025-05-16 | ProxyPrompt: Securing System Prompts against Prompt Extraction Attacks | Zhixiong Zhuang et.al. | 2505.11459 | null |
2025-05-16 | CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs | Sijia Chen et.al. | 2505.11413 | null |
2025-05-16 | AutoRAN: Weak-to-Strong Jailbreaking of Large Reasoning Models | Jiacheng Liang et.al. | 2505.10846 | link |
2025-05-16 | LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs | Ran Li et.al. | 2505.10838 | null |
2025-05-15 | Dark LLMs: The Growing Threat of Unaligned AI Models | Michael Fire et.al. | 2505.10066 | null |
2025-05-15 | Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data | Adel ElZemity et.al. | 2505.09974 | null |
2025-05-16 | PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization | Yidan Wang et.al. | 2505.09921 | link |
2025-05-14 | Adversarial Attack on Large Language Models using Exponentiated Gradient Descent | Sajib Biswas et.al. | 2505.09820 | link |
2025-05-14 | Adversarial Suffix Filtering: a Defense Pipeline for LLMs | David Khachaturov et.al. | 2505.09602 | null |
2025-05-11 | TokenProber: Jailbreaking Text-to-image Models via Fine-grained Word Impact Analysis | Longtian Wang et.al. | 2505.08804 | null |
2025-05-13 | A Large-Scale Empirical Analysis of Custom GPTs' Vulnerabilities in the OpenAI Ecosystem | Sunday Oyinlola Ogundoyin et.al. | 2505.08148 | link |
2025-05-12 | Concept-Level Explainability for Auditing & Steering LLM Responses | Kenza Amara et.al. | 2505.07610 | link |
2025-05-12 | One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models | Haoran Gu et.al. | 2505.07167 | null |
2025-05-10 | Jailbreaking the Text-to-Video Generative Models | Jiayang Liu et.al. | 2505.06679 | null |
2025-05-10 | Practical Reasoning Interruption Attacks on Reasoning Large Language Models | Yu Cui et.al. | 2505.06643 | null |
2025-05-10 | Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model | Xinyue Lou et.al. | 2505.06538 | link |
2025-05-10 | System Prompt Poisoning: Persistent Attacks on Large Language Models Beyond User Injection | Jiawei Guo et.al. | 2505.06493 | null |
2025-05-08 | Defending against Indirect Prompt Injection by Instruction Detection | Tongyu Wen et.al. | 2505.06311 | link |
2025-05-09 | AgentXploit: End-to-End Redteaming of Black-Box AI Agents | Zhun Wang et.al. | 2505.05849 | null |
2025-05-12 | LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities | Kalyan Nakka et.al. | 2505.05619 | link |
2025-05-07 | Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs | Chetan Pathade et.al. | 2505.04806 | null |
2025-05-07 | Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems | Jian Cui et.al. | 2505.04799 | null |
2025-05-07 | A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models | Pedro Pinacho-Davidson et.al. | 2505.04784 | null |
2025-05-07 | The Aloe Family Recipe for Open and Specialized Healthcare LLMs | Dario Garcia-Gasulla et.al. | 2505.04388 | null |
2025-05-07 | Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety | Variath Madhupal Gautham Nair et.al. | 2505.04146 | null |
2025-05-06 | LlamaFirewall: An open source guardrail system for building secure AI agents | Sahana Chennabasappa et.al. | 2505.03574 | null |
2025-05-03 | Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs | Haoming Yang et.al. | 2505.02862 | null |
2025-05-04 | Open Challenges in Multi-Agent Security: Towards Secure Systems of Interacting AI Agents | Christian Schroeder de Witt et.al. | 2505.02077 | null |
2025-05-05 | Helping Large Language Models Protect Themselves: An Enhanced Filtering and Summarization System | Sheikh Samit Muhaimin et.al. | 2505.01315 | null |
2025-05-01 | OET: Optimization-based prompt injection Evaluation Toolkit | Jinsheng Pan et.al. | 2505.00843 | link |
2025-05-05 | The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them) | Zihao Wang et.al. | 2505.00626 | null |
2025-04-29 | HyPerAlign: Hypotheses-driven Personalized Alignment | Cristina Garbacea et.al. | 2505.00038 | null |
2025-04-30 | XBreaking: Explainable Artificial Intelligence for Jailbreaking LLMs | Marco Arazzi et.al. | 2504.21700 | null |
2025-04-30 | Hoist with His Own Petard: Inducing Guardrails to Facilitate Denial-of-Service Attacks on Retrieval-Augmented Generation of LLMs | Pan Suo et.al. | 2504.21680 | null |
2025-04-30 | The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning | Siyi Chen et.al. | 2504.21307 | null |
2025-04-29 | CachePrune: Neural-Based Attribution Defense Against Indirect Prompt Injection Attacks | Rui Wang et.al. | 2504.21228 | null |
2025-04-29 | ACE: A Security Architecture for LLM-Integrated App Systems | Evan Li et.al. | 2504.20984 | null |
2025-04-29 | AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security | Zikui Cai et.al. | 2504.20965 | link |
2025-04-29 | Chain-of-Defensive-Thought: Structured Reasoning Elicits Robustness in Large Language Models against Reference Corruption | Wenxiao Wang et.al. | 2504.20769 | null |
2025-04-29 | Token-Efficient Prompt Injection Attack: Provoking Cessation in LLM Reasoning via Adaptive Token Compression | Yu Cui et.al. | 2504.20493 | null |
2025-04-29 | Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction | Yulin Chen et.al. | 2504.20472 | null |
2025-04-29 | Inception: Jailbreak the Memory Mechanism of Text-to-Image Generation Systems | Shiqian Zhao et.al. | 2504.20376 | null |
2025-04-28 | Prompt Injection Attack to Tool Selection in LLM Agents | Jiawen Shi et.al. | 2504.19793 | null |
2025-04-29 | Security Steerability is All You Need | Itay Hazan et.al. | 2504.19521 | null |
2025-04-28 | JailbreaksOverTime: Detecting Jailbreak Attacks Under Distribution Shift | Julien Piet et.al. | 2504.19440 | link |
2025-04-27 | Small Models, Big Tasks: An Exploratory Empirical Study on Small Language Models for Function Calling | Ishan Kavathekar et.al. | 2504.19277 | link |
2025-04-26 | Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs | Mohammad Akbar-Tajari et.al. | 2504.19019 | link |
2025-04-22 | WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks | Ivan Evtimov et.al. | 2504.18575 | link |
2025-04-25 | Adversarial Attacks on LLM-as-a-Judge Systems: Insights from Prompt Injections | Narek Maloyan et.al. | 2504.18333 | null |
2025-04-23 | Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate | Senmao Qi et.al. | 2504.16489 | null |
2025-04-20 | Breaking the Prompt Wall (I): A Real-World Case Study of Attacking ChatGPT via Lightweight Prompt Injection | Xiangyu Chang et.al. | 2504.16125 | null |
2025-04-26 | T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models | Siyuan Liang et.al. | 2504.15512 | null |
2025-04-21 | MR. Guard: Multilingual Reasoning Guardrail using Curriculum Learning | Yahan Yang et.al. | 2504.15241 | null |
2025-04-20 | Prompt-Hacking: The New p-Hacking? | Thomas Kosch et.al. | 2504.14571 | null |
2025-04-20 | LLM-Enabled In-Context Learning for Data Collection Scheduling in UAV-assisted Sensor Networks | Yousef Emami et.al. | 2504.14556 | null |
2025-04-25 | Manipulating Multimodal Agents via Cross-Modal Prompt Injection | Le Wang et.al. | 2504.14348 | null |
2025-04-18 | DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification | Yu Li et.al. | 2504.13562 | null |
2025-04-15 | X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents | Salman Rahman et.al. | 2504.13203 | null |
2025-04-15 | Concept Enhancement Engineering: A Lightweight and Efficient Robust Defense Against Jailbreak Attacks in Embodied AI | Jirui Yang et.al. | 2504.13201 | null |
2025-04-17 | GraphAttack: Exploiting Representational Blindspots in LLM Safety Mechanisms | Sinan He et.al. | 2504.13052 | null |
2025-04-17 | ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition | Haidar Khan et.al. | 2504.12562 | link |
2025-04-14 | You've Changed: Detecting Modification of Black-Box Large Language Models | Alden Dima et.al. | 2504.12335 | null |
2025-04-15 | DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks | Yupei Liu et.al. | 2504.11358 | link |
2025-04-16 | Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails | William Hackett et.al. | 2504.11168 | null |
2025-04-15 | Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models | Jiangtao Liu et.al. | 2504.11106 | null |
2025-04-14 | The Jailbreak Tax: How Useful are Your Jailbreak Outputs? | Kristina Nikolić et.al. | 2504.10694 | link |
2025-04-14 | Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding | Tao Zhang et.al. | 2504.10465 | link |
2025-04-16 | LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks | Soumyadeep Pal et.al. | 2504.10185 | link |
2025-04-14 | RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability | Yichi Zhang et.al. | 2504.10081 | null |
2025-04-14 | StruPhantom: Evolutionary Injection Attacks on Black-Box Tabular Agents Powered by Large Language Models | Yang Feng et.al. | 2504.09841 | null |
2025-04-13 | The Structural Safety Generalization Problem | Julius Broomfield et.al. | 2504.09712 | link |
2025-04-13 | Mitigating Many-Shot Jailbreaking | Christopher M. Ackerman et.al. | 2504.09604 | null |
2025-04-13 | ControlNET: A Firewall for RAG-based LLM System | Hongwei Yao et.al. | 2504.09593 | null |
2025-04-13 | AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender | Weixiang Zhao et.al. | 2504.09466 | null |
2025-04-13 | SaRO: Enhancing LLM Safety through Reasoning-based Alignment | Yutao Mou et.al. | 2504.09420 | null |
2025-04-12 | Feature-Aware Malicious Output Detection and Mitigation | Weilong Dong et.al. | 2504.09191 | null |
2025-04-10 | Geneshift: Impact of different scenario shift on Jailbreaking LLM | Tianyi Wu et.al. | 2504.08104 | null |
2025-04-10 | Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge | Riccardo Cantini et.al. | 2504.07887 | link |
2025-04-10 | Defense against Prompt Injection Attacks via Mixture of Encodings | Ruiyi Zhang et.al. | 2504.07467 | link |
2025-04-09 | Bypassing Safety Guardrails in LLMs Using Humor | Pedro Cisneros-Velarde et.al. | 2504.06577 | null |
2025-04-08 | Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking | Junxi Chen et.al. | 2504.05838 | link |
2025-04-08 | Separator Injection Attack: Uncovering Dialogue Biases in Large Language Models Caused by Role Separators | Xitao Li et.al. | 2504.05689 | null |
2025-04-08 | Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking | Yu-Hang Wu et.al. | 2504.05652 | link |
2025-04-07 | A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models | Carlos Peláez-González et.al. | 2504.04976 | null |
2025-04-08 | Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models | Yubo Li et.al. | 2504.04717 | link |
2025-04-06 | StyleRec: A Benchmark Dataset for Prompt Recovery in Writing Style Transformation | Shenyang Liu et.al. | 2504.04373 | null |
2025-04-08 | JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model | Yi Nian et.al. | 2504.03770 | link |
2025-04-03 | More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment | Yifan Wang et.al. | 2504.02193 | null |
2025-04-02 | Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses | Zhengchun Shang et.al. | 2504.02080 | null |
2025-04-02 | Representation Bending for Large Language Model Safety | Ashkan Yousefpour et.al. | 2504.01550 | link |
2025-04-02 | LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution | Zhuoran Yang et.al. | 2504.01533 | null |
2025-04-07 | PiCo: Jailbreaking Multimodal Large Language Models via $\textbf{Pi}$ctorial |
Aofan Liu et.al. | 2504.01444 | null |
2025-04-07 | Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks | Jiawei Wang et.al. | 2504.01308 | link |
2025-04-02 | Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning | Si Chen et.al. | 2504.01278 | null |
2025-04-01 | Multilingual and Multi-Accent Jailbreaking of Audio LLMs | Jaechul Roh et.al. | 2504.01094 | null |
2025-04-01 | Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics | Shide Zhou et.al. | 2504.00446 | null |
2025-03-31 | Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms | Shuoming Zhang et.al. | 2503.24191 | null |
2025-03-29 | Encrypted Prompt: Securing LLM Applications Against Unauthorized Actions | Shih-Han Chan et.al. | 2503.23250 | null |
2025-03-27 | Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing | Johan Wahréus et.al. | 2503.21598 | null |
2025-03-27 | Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection | Ryan Marinelli et.al. | 2503.21464 | link |
2025-03-26 | Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy | Joonhyun Jeong et.al. | 2503.20823 | link |
2025-03-26 | Iterative Prompting with Persuasion Skills in Jailbreaking Large Language Models | Shih-Wen Ke et.al. | 2503.20320 | null |
2025-03-26 | sudo rm -rf agentic_security | Sejin Lee et.al. | 2503.20279 | link |
2025-03-24 | MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks | Wenhao You et.al. | 2503.19134 | null |
2025-03-23 | SRMIR: Shadow Reward Models Based on Introspective Reasoning for LLM Alignment | Ruoxi Cheng et.al. | 2503.18991 | null |
2025-03-24 | Defeating Prompt Injections by Design | Edoardo Debenedetti et.al. | 2503.18813 | null |
2025-03-23 | Metaphor-based Jailbreaking Attacks on Text-to-Image Models | Chenyu Zhang et.al. | 2503.17987 | null |
2025-03-23 | Smoke and Mirrors: Jailbreaking LLM-based Code Generation via Implicit Malicious Prompts | Sheng Ouyang et.al. | 2503.17953 | null |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-07-21 | AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming | Jierui Li et.al. | 2507.15378 | null |
2025-07-16 | When Retriever Meets Generator: A Joint Model for Code Comment Generation | Tien P. T. Le et.al. | 2507.12558 | null |
2025-07-07 | Unified Framework for Quantum Code Embedding | Andrew C. Yuan et.al. | 2507.05361 | null |
2025-05-27 | Semi-supervised Clustering Through Representation Learning of Large-scale EHR Data | Linshanshan Wang et.al. | 2505.20731 | null |
2025-05-19 | Towards A Generalist Code Embedding Model Based On Massive Data Synthesis | Chaofan Li et.al. | 2505.12697 | link |
2025-05-31 | Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes | Xueqing Liu et.al. | 2503.22935 | null |
2025-07-17 | OASIS: Order-Augmented Strategy for Improved Code Search | Zuchen Gao et.al. | 2503.08161 | null |
2025-03-10 | Assessing Uncertainty in Stock Returns: A Gaussian Mixture Distribution-Based Method | Yanlong Wang et.al. | 2503.06929 | null |
2025-06-02 | LoRACode: LoRA Adapters for Code Embeddings | Saumya Chaturvedi et.al. | 2503.05315 | null |
2025-03-07 | Extended Controllability Tests for Quantum Decoherence-Free Subspaces | Eric B. Kopp et.al. | 2503.05155 | null |
2025-02-21 | GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer | Yufan Ye et.al. | 2502.15202 | null |
2025-03-16 | Poisoned Source Code Detection in Code Models | Ehab Ghannoum et.al. | 2502.13459 | null |
2025-02-07 | EnseSmells: Deep ensemble and programming language models for automated code smells detection | Anh Ho et.al. | 2502.05012 | link |
2025-03-26 | Intelligent Code Embedding Framework for High-Precision Ransomware Detection via Multimodal Execution Path Analysis | Levi Gareth et.al. | 2501.15836 | null |
2024-12-18 | Transducer Tuning: Efficient Model Adaptation for Software Tasks Using Code Property Graphs | Imam Nur Bani Yusuf et.al. | 2412.13467 | link |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-07-08 | Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms | Tarek Gasmi et.al. | 2507.06323 | null |
2025-07-05 | We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems | Zhihao Li et.al. | 2507.06250 | null |
2025-06-27 | Conversational LLMs Simplify Secure Clinical Data Access, Understanding, and Analysis | Rafi Al Attrach et.al. | 2507.01053 | null |
2025-07-01 | VTS-Guided AI Interaction Workflow for Business Insights | Sun Ding et.al. | 2507.00347 | null |
2025-06-30 | A Large-Scale Evolvable Dataset for Model Context Protocol Ecosystem and Security Analysis | Zhiwei Lin et.al. | 2506.23474 | null |
2025-06-29 | From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows | Mohamed Amine Ferrag et.al. | 2506.23260 | null |
2025-06-18 | RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments | Yuchuan Fu et.al. | 2506.15253 | link |
2025-06-08 | Personalized Constitutionally-Aligned Agentic Superego: Secure AI Behavior Aligned to Diverse Human Values | Nell Watson et.al. | 2506.13774 | null |
2025-06-20 | Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers | Mohammed Mehedi Hasan et.al. | 2506.13538 | link |
2025-06-12 | QuantMCP: Grounding Large Language Models in Verifiable Financial Reality | Yifan Zeng et.al. | 2506.06622 | null |
2025-05-26 | Survey of LLM Agent Communication with MCP: A Software Design Pattern Centric Review | Anjana Sarkar et.al. | 2506.05364 | null |
2025-06-05 | Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem | Hao Song et.al. | 2506.02040 | link |
2025-06-02 | ETDI: Mitigating Tool Squatting and Rug Pull Attacks in Model Context Protocol (MCP) by using OAuth-Enhanced Tool Definitions and Policy-Based Access Control | Manish Bhatt et.al. | 2506.01333 | null |
2025-05-30 | Chances and Challenges of the Model Context Protocol in Digital Forensics and Incident Response | Jan-Niclas Hilgert et.al. | 2506.00274 | null |
2025-05-27 | ADA: Automated Moving Target Defense for AI Workloads via Ephemeral Infrastructure-Native Rotation in Kubernetes | Akram Sheriff et.al. | 2505.23805 | null |
2025-05-29 | MCP Safety Training: Learning to Refuse Falsely Benign MCP Exploits using Improved Preference Alignment | John Halloran et.al. | 2505.23634 | null |
2025-05-28 | AgentDNS: A Root Domain Naming System for LLM Agents | Enfang Cui et.al. | 2505.22368 | null |
2025-05-23 | Gaming Tool Preferences in Agentic LLMs | Kazem Faghih et.al. | 2505.18135 | link |
2025-05-22 | Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models | Junjie Xiong et.al. | 2505.16957 | null |
2025-05-16 | MPMA: Preference Manipulation Attack Against Model Context Protocol | Zihan Wang et.al. | 2505.11154 | null |
2025-05-06 | From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems | Qiaomu Li et.al. | 2505.03864 | null |
2025-05-23 | A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP) | Abul Ehtesham et.al. | 2505.02279 | null |
2025-04-28 | Simplified and Secure MCP Gateways for Enterprise AI Integration | Ivo Brett et.al. | 2504.19997 | link |
2025-04-28 | Securing GenAI Multi-Agent Systems Against Tool Squatting: A Zero Trust Registry-Based Approach | Vineeth Sai Narajala et.al. | 2504.19951 | null |
2025-04-28 | From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review | Mohamed Amine Ferrag et.al. | 2504.19678 | null |
2025-05-02 | Building A Secure Agentic AI Application Leveraging A2A Protocol | Idan Habler et.al. | 2504.16902 | null |
2025-05-19 | MCP Guardian: A Security-First Layer for Safeguarding MCP-Based AI System | Sonu Kumar et.al. | 2504.12757 | null |
2025-04-11 | MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers | Arash Ahmadi et.al. | 2504.08999 | null |
2025-05-02 | Enterprise-Grade Security for the Model Context Protocol (MCP): Frameworks and Mitigation Strategies | Vineeth Sai Narajala et.al. | 2504.08623 | null |
2025-04-11 | MCP Safety Audit: LLMs with the Model Context Protocol Allow Major Security Exploits | Brandon Radosevich et.al. | 2504.03767 | link |
2025-04-06 | Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions | Xinyi Hou et.al. | 2503.23278 | null |
Publish Date | Title | Authors | Code | |
---|---|---|---|---|
2025-06-24 | FuncVul: An Effective Function Level Vulnerability Detection Model using LLM and Code Chunk | Sajal Halder et.al. | 2506.19453 | null |
2025-05-30 | When GPT Spills the Tea: Comprehensive Assessment of Knowledge File Leakage in GPTs | Xinyue Shen et.al. | 2506.00197 | null |
2025-07-15 | Seven Security Challenges That Must be Solved in Cross-domain Multi-agent LLM Systems | Ronny Ko et.al. | 2505.23847 | null |
2025-05-27 | JavaSith: A Client-Side Framework for Analyzing Potentially Malicious Extensions in Browsers, VS Code, and NPM Packages | Avihay Cohen et.al. | 2505.21263 | null |
2025-06-30 | LibVulnWatch: A Deep Assessment Agent System and Leaderboard for Uncovering Hidden Vulnerabilities in Open-Source AI Libraries | Zekun Wu et.al. | 2505.08842 | null |
2025-05-07 | Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems | Jian Cui et.al. | 2505.04799 | null |
2025-05-02 | A Rusty Link in the AI Supply Chain: Detecting Evil Configurations in Model Repositories | Ziqi Ding et.al. | 2505.01067 | null |
2025-04-29 | Understanding Large Language Model Supply Chain: Structure, Domain, and Vulnerabilities | Yanzhe Hu et.al. | 2504.20763 | null |
2025-04-24 | Automatically Generating Rules of Malicious Software Packages via Large Language Model | XiangRui Zhang et.al. | 2504.17198 | null |
2025-03-27 | Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation | Rafiqul Rabin et.al. | 2503.22760 | null |
2025-05-26 | The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs | Peipei Liu et.al. | 2503.07215 | null |
2025-02-18 | SoK: Understanding Vulnerabilities in the Large Language Model Supply Chain | Shenao Wang et.al. | 2502.12497 | null |
2025-01-31 | Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities | Arjun Krishna et.al. | 2501.19012 | null |
2024-12-26 | Integrating Artificial Open Generative Artificial Intelligence into Software Supply Chain Security | Vasileios Alevizos et.al. | 2412.19088 | null |
2024-12-23 | Emerging Security Challenges of Large Language Models | Herve Debar et.al. | 2412.17614 | null |
2024-12-22 | Enhancing Supply Chain Transparency in Emerging Economies Using Online Contents and LLMs | Bohan Jin et.al. | 2412.16922 | null |
2024-12-18 | RAG for Effective Supply Chain Security Questionnaire Automation | Zaynab Batool Reza et.al. | 2412.13988 | null |
2025-03-30 | Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors | Yuefeng Peng et.al. | 2411.01705 | null |
2024-11-03 | Large Language Model Supply Chain: Open Problems From the Security Perspective | Qiang Hu et.al. | 2411.01604 | null |