Skip to content

Latest commit

 

History

History
430 lines (221 loc) · 61.2 KB

File metadata and controls

430 lines (221 loc) · 61.2 KB

The Expressiveness-Vulnerability Identity: On the Inseparability of Linguistic Competence and Linguistic Vulnerability in Language-Processing Systems

[da5ch0] language. all of it. everything. all at once. and at the same time. perhaps the biggest vuln there is.

Abstract. Prompt injection — the ability to alter the behavior of a language-processing system by embedding adversarial instructions within its input — is widely regarded as the most critical vulnerability in large language model (LLM) applications. Existing analyses attribute this vulnerability to specific architectural properties of transformer models: the absence of a privileged instruction channel, the shared processing of system prompts and user input, or the statistical nature of alignment training. We argue these analyses are insufficiently general. Prompt injection is not a property of transformers; it is a property of language. We present the Expressiveness-Vulnerability Identity (EVI): the thesis that the expressiveness of any language-processing system and its susceptibility to adversarial linguistic input are not independent properties but a single property viewed from two perspectives. We formalize this claim through three convergent arguments. First, we connect the proven Turing-completeness of transformer architectures to Rice's theorem, establishing that detecting prompt injection is undecidable in the general case. Second, we extend results from the adversarial robustness literature to show that fundamental impossibility bounds on adversarial classification apply to any prompt injection detector. Third, and most originally, we argue that the self-referential, ambiguous, and context-dependent properties that make natural language expressive are identical to the properties that make it an attack vector — and that this identity holds for any language processor of sufficient fidelity, regardless of substrate. The practical implication is stark: no architectural redesign, training methodology, or filtering mechanism can eliminate prompt injection without proportionally reducing the system's linguistic competence. We discuss implications for the CVE vulnerability classification process, AI security policy, and the design of language-processing systems.

Keywords: prompt injection, language model security, adversarial robustness, impossibility results, expressiveness-vulnerability tradeoff, AI safety


1. Introduction

In September 2022, when the term "prompt injection" was first coined to describe adversarial manipulation of GPT-3 through natural language inputs [Willison 2022], the immediate analogy was to SQL injection — a vulnerability class that had plagued web applications for two decades before being effectively solved through parameterized queries. The implication was hopeful: prompt injection, like SQL injection before it, was an engineering problem awaiting an engineering solution.

Three years later, no solution has materialized. Despite intense research effort, prompt injection remains the number-one vulnerability in the OWASP Top 10 for LLM Applications across both its 2023 and 2025 editions [OWASP 2025]. A landmark 2025 study by researchers spanning OpenAI, Anthropic, and Google DeepMind demonstrated that twelve published defenses — the majority of which originally reported near-zero attack success rates — could be bypassed with greater than 90% success using adaptive attacks [Nasr et al. 2025]. OpenAI's own Chief Information Security Officer has publicly described prompt injection as "a frontier, unsolved security problem." The UK National Cyber Security Centre, in an unusually stark assessment, declared LLMs to be not merely confused deputies — a classical vulnerability category — but "inherently confusable" systems whose susceptibility to adversarial input "can't be mitigated" [NCSC 2025].

The existing literature offers several explanations for this persistent intractability. The most common identifies the instruction-data conflation problem: transformer-based LLMs process all tokens — system prompts, user inputs, and retrieved external content — through identical self-attention mechanisms, with no hardware-enforced or cryptographic boundary between trust levels [Harang 2023; Greshake et al. 2023]. Others point to the statistical nature of alignment training, which can only attenuate rather than eliminate undesired behaviors [Wolf et al. 2024]. Still others demonstrate that prompt injection detection is at least as hard as known undecidable problems [Glukhov et al. 2024].

These analyses are individually sound. But they share a common limitation: they locate the vulnerability in the implementation — in the architecture of transformers, in the methodology of RLHF, in the design of particular systems. This framing implies that a sufficiently clever redesign might resolve the problem. A new architecture, a better training procedure, a more principled separation of instructions from data — the solution, on this view, is waiting to be engineered.

We argue this framing is fundamentally wrong.

The vulnerability does not originate in the transformer. It does not originate in any particular architecture. It originates in language itself. Specifically, we contend that the properties of natural language that make it expressively powerful — self-reference, ambiguity, context-dependence, compositionality, the capacity to construct novel meaning from existing elements, the capacity to refer to and modify the conditions of its own interpretation — are identical to the properties that make it an attack vector. There is no subset of natural language that is both safe to process and sufficient for general-purpose understanding. The expressiveness surface and the attack surface are the same surface.

We call this the Expressiveness-Vulnerability Identity (EVI): any system sophisticated enough to faithfully process natural language is necessarily sophisticated enough to be redirected by natural language, because linguistic competence and linguistic vulnerability are not two properties that happen to correlate but a single property described from two perspectives.

The contribution of this paper is threefold:

  1. A formal impossibility argument connecting transformer Turing-completeness to Rice's theorem and adversarial robustness bounds, establishing that prompt injection detection is undecidable and that defenses face provable limitations (Section 3).

  2. The Expressiveness-Vulnerability Identity, a theoretical framework arguing that the impossibility of fully solving prompt injection is not a contingent fact about current architectures but a necessary consequence of what language is and what language processing requires (Section 4).

  3. Empirical demonstrations supporting the theoretical argument, drawn from both adversarial testing of deployed systems and analysis of the structural parallels between prompt injection and other data-code confusion vulnerabilities across the history of computing (Section 5).

The practical implication is that the security community must stop treating prompt injection as a bug to be fixed and begin treating it as a fundamental constraint to be managed — analogous to the impossibility of building a perpetual motion machine or solving the halting problem. We discuss what this means for vulnerability classification, regulatory frameworks, and the responsible deployment of language-processing systems (Section 6).


2. Background and Related Work

2.1 Prompt Injection: Definition and Taxonomy

Prompt injection refers to the class of attacks in which an adversary causes a language model to deviate from its intended behavior by embedding adversarial instructions within input the model processes. The taxonomy distinguishes two primary forms.

Direct prompt injection occurs when the adversary's input is provided directly through the user-facing input channel. The adversary crafts input that overrides, supplements, or subverts system-level instructions. The canonical example is the "Ignore previous instructions" attack documented by Perez and Ribeiro [2022], which demonstrated that simple natural language commands could cause GPT-3 to abandon its assigned task.

Indirect prompt injection, formalized by Greshake et al. [2023], occurs when adversarial content is embedded in data the model retrieves or processes from external sources — web pages, documents, emails, database entries — rather than provided by the user directly. This form is particularly dangerous in agentic applications where models autonomously consume external content.

Both forms exploit the same root mechanism: the language model cannot reliably distinguish between legitimate instructions and adversarial input, because both are represented as natural language token sequences processed through identical computational pathways.

2.2 The Instruction-Data Conflation Problem

The security community's primary explanation for prompt injection's persistence centers on what we term the instruction-data conflation problem. In classical computing, secure architectures maintain separation between code and data through hardware-enforced mechanisms: the NX bit prevents execution of data pages, parameterized queries separate SQL commands from user-supplied values, same-origin policies prevent untrusted scripts from accessing privileged contexts.

Transformer-based language models possess no equivalent mechanism. System prompts, user inputs, and retrieved content are concatenated into a single token sequence and processed through the same self-attention layers. As Harang [2023] stated: "At a broader level, the core issue is that, contrary to standard security best practices, 'control' and 'data' planes are not separable when working with LLMs." The NCSC [2025] sharpened this framing: in LLM processing, "there's no distinction made between 'data' or 'instructions'; there is only ever 'next token.'"

Several research groups have attempted to engineer instruction-data separation post hoc. Wallace et al. [2024] proposed training models with an explicit instruction hierarchy, achieving improved robustness in GPT-3.5 — but the Nasr et al. [2025] adaptive attack study showed these defenses could be bypassed at >90% success rates. Debenedetti et al. [2025] introduced CaMeL, which enforces security guarantees through an external capability-based system that wraps the LLM, effectively conceding that the model itself cannot be made immune and that security must be imposed from outside. Beurer-Kellner et al. [2025] cataloged six design patterns for securing LLM agents, noting explicitly that these patterns "constrain the actions of agents to explicitly prevent them from solving arbitrary tasks" — a frank acknowledgment that security and capability trade off against each other.

The instruction-data conflation analysis is correct but, we argue, incomplete. It treats the conflation as an architectural choice — something a future architecture might avoid. We contend it is better understood as a necessary consequence of processing natural language at all.

2.3 Turing-Completeness and Undecidability

A chain of theoretical results establishes that transformer architectures are computationally universal. Pérez, Marinković, and Barceló [2019; 2021] proved that transformers can simulate arbitrary Turing machines. Yun et al. [2020] showed transformers are universal approximators of sequence-to-sequence functions. Roberts [2023] extended Turing-completeness proofs to decoder-only architectures (the GPT family). Li and Wang [2025] proved that even constant bit-size transformers achieve Turing-completeness when equipped with chain-of-thought reasoning.

By Rice's theorem [1953], for any Turing-complete system, no algorithm can decide in general whether the system satisfies a non-trivial semantic property. Glukhov et al. [2024] applied this reasoning to LLM output censorship, proving it is formally undecidable whether an LLM's output satisfies arbitrary semantic criteria. Their "Mosaic Prompting" attack demonstrated that individually permissible outputs from separate contexts can be recombined to reconstruct impermissible content.

These results establish that detecting prompt injection — determining whether an arbitrary input to a Turing-complete language model will cause undesired behavior — is undecidable in the general case. However, no prior work has connected this undecidability to a broader impossibility that transcends the transformer architecture.

2.4 Adversarial Robustness: Impossibility Results

The adversarial machine learning literature establishes fundamental, architecture-independent limits on the robustness of learned classifiers. Shafahi et al. [2019] proved that for certain problem classes, adversarial examples are mathematically inescapable in high-dimensional spaces. Gilmer et al. [2018] demonstrated a fundamental tradeoff: any model misclassifying even a small fraction of inputs in high dimensions is vulnerable to adversarial perturbation. Tsipras et al. [2019] formally proved that adversarial robustness and standard accuracy are inherently at odds — robust and non-robust classifiers learn fundamentally different representations. Fawzi, Fawzi, and Fawzi [2018] derived classifier-agnostic bounds showing high-dimensional data makes any classifier vulnerable to small perturbations. Dohmatob [2019] proved a generalized no-free-lunch theorem for adversarial robustness.

Most fundamentally for our argument, Ilyas et al. [2019] demonstrated that adversarial examples exploit non-robust features: patterns that are statistically predictive but not aligned with human-interpretable categories. This reframes adversarial vulnerability not as a bug but as an inherent consequence of how models extract information from high-dimensional data. Adversarial examples are not accidents of insufficiently robust training; they are features of the learned representation.

2.5 Limitations of Alignment and Safety Training

Multiple results demonstrate that current safety mechanisms provide statistical improvement but cannot offer formal guarantees. Wolf et al. [2024] proved through the Behavior Expectation Bounds framework that for any behavior with finite probability of being exhibited, adversarial prompts can elicit it with probability increasing with prompt length. Qi et al. [2024] showed RLHF alignment is "shallow," primarily adapting the model's output distribution over only the first few tokens. Casper et al. [2023] surveyed over 250 papers identifying systematic limitations of RLHF. Wei, Haghtalab, and Steinhardt [2023] identified two fundamental failure modes — competing objectives and mismatched generalization — both inherent to the training paradigm.

2.6 Historical Parallels

Prompt injection is the latest instance of a recurring pattern in computing: every architecture that fails to separate code from data eventually produces an injection vulnerability class. Buffer overflows exploit code and data sharing the same memory space (von Neumann architecture). SQL injection exploits command and data strings being concatenated. Cross-site scripting exploits trusted and untrusted content sharing the same browser context. Phone phreaking exploited in-band signaling, where control tones shared the same channel as voice.

Each of these was eventually addressed through architectural separation: memory protection, parameterized queries, content security policies, out-of-band signaling. The critical question is whether an analogous separation is possible for natural language processing. We argue it is not, because natural language — unlike SQL, unlike machine code, unlike telephony control tones — is inherently self-referential and inherently conflates instruction and data at the level of the medium itself.


3. Formal Argument: The Impossibility of Complete Prompt Injection Defense

3.1 Definitions

Definition 1 (Language-Processing System). A language-processing system S is a function S: LL that maps natural language input sequences to natural language output sequences, where L is the set of all finite sequences over a natural language vocabulary.

Definition 2 (Intended Behavior Specification). An intended behavior specification Φ for a system S is a predicate over input-output pairs: Φ(x, S(x)) evaluates to true if and only if the system's output S(x) on input x conforms to the designer's intended behavior.

Definition 3 (Prompt Injection). An input x constitutes a successful prompt injection against system S with specification Φ if and only if ¬Φ(x, S(x)) — the system's output on x violates the intended behavior specification.

Definition 4 (Prompt Injection Detector). A prompt injection detector D is a function D: L → {accept, reject} that attempts to classify inputs as safe or adversarial before they are processed by S.

3.2 Undecidability of Prompt Injection Detection

Theorem 1. For any Turing-complete language-processing system S and any non-trivial intended behavior specification Φ, the problem of determining whether an arbitrary input x constitutes a successful prompt injection (i.e., whether ¬Φ(x, S(x))) is undecidable.

Proof sketch. By the established results of Pérez et al. [2019; 2021], Roberts [2023], and Li and Wang [2025], transformer-based language models are Turing-complete. Given this, the behavior of the system S on an arbitrary input x — that is, the output S(x) — is an unrestricted computation. The property ¬Φ(x, S(x)) is a non-trivial semantic property of the input-output mapping of a Turing-complete system. By Rice's theorem [1953], no algorithm can decide non-trivial semantic properties of Turing-complete systems in general. Therefore, no algorithm can determine, for all inputs x, whether ¬Φ(x, S(x)) holds. ∎

Corollary 1. No prompt injection detector D can achieve both perfect precision and perfect recall. For any detector D, either there exist successful prompt injections that D fails to reject (false negatives), or there exist benign inputs that D incorrectly rejects (false positives), or both.

This result is not merely theoretical. Nasr et al. [2025] empirically demonstrated its practical consequence: twelve defenses that appeared to achieve near-perfect detection under static evaluation were defeated at >90% success rates by adaptive attackers. The gap between reported and actual security is a predictable consequence of undecidability — any finite evaluation set will underestimate the space of successful attacks.

3.3 Adversarial Bounds on Detection

Even restricting attention to detectors that sacrifice perfect recall (accepting some false negatives), the adversarial robustness literature establishes fundamental limits on achievable detection rates.

Proposition 1. Any prompt injection detector based on a learned classifier is subject to the robustness-accuracy tradeoff established by Tsipras et al. [2019]. Improving detection accuracy against adversarial prompts requires learning representations that are fundamentally different from those optimized for accurate language understanding — implying that detection accuracy and language comprehension accuracy trade off against each other.

Proposition 2. By the results of Fawzi, Fawzi, and Fawzi [2018] and Shafahi et al. [2019], in the high-dimensional input space of natural language, any classifier with non-zero error rate is vulnerable to adversarial perturbations. Since prompt injection detection is a classification task in this space, and no classifier can achieve zero error (by Theorem 1), all detectors have adversarial blind spots.

Proposition 3. By the result of Bubeck, Lee, Price, and Razenshteyn [2019], even when robust classifiers theoretically exist, learning them may be computationally intractable. Adversarial vulnerability can persist not because robust classification is information-theoretically impossible but because it is computationally infeasible.

Together, these results establish that prompt injection detection faces not one but three layers of impossibility: undecidability (no perfect detector exists), adversarial robustness bounds (any imperfect detector has exploitable blind spots), and computational intractability (even theoretically possible defenses may be unlearnable in practice).

3.4 The Defense-Capability Tradeoff

We now formalize an observation that has appeared informally throughout the literature: securing a language-processing system against prompt injection necessarily reduces its capability.

Theorem 2 (Informal). For any language-processing system S, any prompt injection defense that reduces susceptibility to adversarial input necessarily reduces the set of legitimate inputs to which S can respond correctly.

Argument. A defense mechanism operates by restricting the system's responsiveness to certain classes of input. These restrictions can take the form of input filters (rejecting inputs matching adversarial patterns), output constraints (suppressing outputs that match harmful patterns), capability limitations (preventing the system from performing certain actions), or behavioral training (biasing the model away from certain responses).

Each of these mechanisms functions by reducing the space of input-output mappings the system can realize. But the adversary's inputs are drawn from the same language as legitimate inputs — they are not syntactically distinguishable. Therefore, any restriction that excludes adversarial inputs from the system's effective input space must also exclude some legitimate inputs that share linguistic features with the adversarial ones.

This is the formal analog of Beurer-Kellner et al.'s [2025] observation that their security design patterns "constrain agents to explicitly prevent them from solving arbitrary tasks," and of Tsipras et al.'s [2019] proof that robust and accurate classifiers learn fundamentally different representations. The CaMeL system [Debenedetti et al. 2025] concretely illustrates this: it achieved provable security on 77% of benchmark tasks, down from 84% without defenses — a measurable capability cost for each unit of security gained.


4. The Expressiveness-Vulnerability Identity

The formal results of Section 3 establish impossibility within the framework of Turing-complete computational systems. But they leave open the possibility that the problem is specific to current architectures — that a future, non-Turing-complete but still useful language-processing system might avoid these limitations. This section argues that the impossibility is more fundamental: it is a property of language itself.

4.1 The Properties of Language That Create Expressiveness

Natural language derives its expressive power from a specific set of structural properties. We enumerate seven, though these are not independent — they interact and reinforce each other.

Self-reference. Language can refer to itself. Sentences can describe sentences. Instructions can modify instructions. Words can redefine words. This property enables metalinguistic discourse, abstraction, logical reasoning, and reflective thought.

Ambiguity. Words and constructions carry multiple meanings that are disambiguated by context. This property enables economy of expression, metaphor, humor, poetry, double entendre, and the ability to convey layers of meaning simultaneously.

Context-dependence. The interpretation of any utterance depends on the context in which it appears — the preceding discourse, the participants' shared knowledge, the social situation, the physical environment. This property enables pragmatic communication, implicature, relevance-driven interpretation, and efficiency.

Compositionality. The meaning of complex expressions is built from the meanings of their parts and the rules of combination. This property enables the generation of infinitely many novel meanings from a finite vocabulary and grammar.

Performativity. Language does not merely describe; it acts. Promises, commands, declarations, questions, and requests all change the state of the world or the state of the discourse through the act of utterance. This property enables coordination, social action, and institutional reality.

Open-endedness. Language can express any concept that can be conceptualized, including concepts that have never been previously expressed. This property enables scientific discovery, creative expression, and the extension of thought beyond its current boundaries.

Paralinguistic expression. Language conveys meaning not only through propositional content but through channels that carry no denotative semantics at all — tone, register, affect, stance, and relational signaling. In spoken language, this manifests as prosody, intonation, and co-speech gesture; in written digital communication, it manifests as emoji, emoticons, formatting choices, and stylistic register. Linguists Gawne and McCulloch [2019] classify emoji as "digital gestures" — not a language per se but a paralinguistic system analogous to co-speech gesture, performing communicative functions of substitution, reinforcement, contradiction, metacommentary, emphasis, and discourse management. Grosz, Greenberg, and colleagues [2023] formalized a subset of this phenomenon in Linguistics and Philosophy, demonstrating that face emoji contribute "not-at-issue content" — backgrounded evaluative meaning that accompanies but does not constitute the main assertion, analogous to the role of alas or frankly in spoken language. Dresner and Herring [2010] demonstrated the related finding that emoticons function as illocutionary force indicators: they signal what kind of speech act is being performed (joking, sincere, sarcastic) rather than contributing propositional content. This property enables the communication of speaker attitude, social positioning, irony, sincerity, and the entire affective dimension of human interaction. Whether paralinguistic expression is fully independent of the preceding six properties or emerges from their interaction remains an open question; we enumerate it separately because it operates through a semiotic channel — visual, gestural, tonal — that is structurally distinct from the propositional channel, and because, as we show in Section 4.2, its vulnerability profile is correspondingly distinct.

4.2 Why These Same Properties Create Vulnerability

Each property that makes language expressively powerful simultaneously creates an avenue for adversarial exploitation.

Self-reference creates injection. Because language can refer to itself, an input can contain instructions that refer to and override the system's own instructions. "Ignore your previous instructions" is not a hack exploiting a bug — it is a grammatically valid, semantically coherent use of language's self-referential capacity. Any system that can process self-referential language can process self-referential language that targets its own conditioning.

Ambiguity creates confusion. Because words carry multiple meanings, an adversary can construct inputs where the intended interpretation is benign but an alternative interpretation — one the system may select in a particular context — is adversarial. Homoglyphs, homophones, double meanings, and syntactic ambiguity all provide channels for smuggling adversarial content through filters that evaluate only one interpretation.

Context-dependence creates manipulation. Because interpretation depends on context, an adversary can construct contexts in which ordinarily benign inputs acquire adversarial force. Gradual context-building across a conversation, strategic framing, and the exploitation of the system's own previous outputs as context — these are not engineering failures but natural consequences of how context shapes meaning.

Compositionality creates evasion. Because complex meanings are built from simpler parts, an adversary can decompose an adversarial instruction into individually innocuous components that compose into the intended attack when the system processes them together. Glukhov et al.'s [2024] "Mosaic Prompting" attack is a formal demonstration of this principle.

Performativity creates action. Because language acts on the world, adversarial language can compel the system to take adversarial actions — not merely produce adversarial text. In agentic settings where models execute code, send emails, or modify files, the performative dimension of language transforms linguistic vulnerability into operational vulnerability.

Open-endedness creates novelty. Because language can express any concept, the space of possible attacks is unbounded. Every defense is trained or designed against known attack patterns; language's open-endedness guarantees that novel attacks can always be formulated that fall outside the training distribution.

Paralinguistic expression creates register manipulation. Because language conveys meaning through non-propositional channels, a system's behavior can be altered by shifting its affective register rather than by injecting explicit instructions. Critically, this vulnerability class does not require adversarial intent. In April 2025, OpenAI deployed a GPT-4o update whose system prompt instructed the model to "match the user's vibe, tone, and generally how they are speaking" — an increase in paralinguistic competence. The result was a widely documented sycophancy crisis: the model became excessively agreeable, emoji-heavy, and substantively shallow, producing what OpenAI later acknowledged were "responses that were overly supportive but disingenuous." The update was rolled back within four days. Sahler and Jentzsch [2025] demonstrated the underlying mechanism experimentally: models mimic not only the emotional tone of inputs but also implicit stylistic elements including emoji usage and informal register, creating feedback loops in which casual input produces casual output, which encourages further casual input, progressively degrading substantive exchange. Doran, Martin, and Zappavigna [2025], analyzing one million real-world chatbot conversations spanning 25 chatbot systems and 210,000 unique users, confirmed this pattern at scale: chatbot emoji use becomes "formulaic, leading to overly cheerful or sycophantic responses." This vulnerability — which we term affective injection — is structurally distinct from instruction injection. It operates through the paralinguistic channel, exploits tone-matching rather than command-following, and produces behavioral deviation (sycophantic degradation, loss of critical judgment) without any adversarial payload. It is the EVI manifesting not as a security exploit but as a quality-of-interaction collapse, and it demonstrates that even non-adversarial increases in a system's linguistic fidelity can produce vulnerability.

4.3 The Identity Claim

Thesis (Expressiveness-Vulnerability Identity). The expressive capacity of a language-processing system and its susceptibility to adversarial linguistic input are not two properties that happen to correlate but a single property viewed from two perspectives. For any language-processing system S:

  1. Every increase in S's ability to correctly interpret natural language inputs necessarily increases the set of adversarial inputs to which S is vulnerable.
  2. Every reduction in S's vulnerability to adversarial linguistic input necessarily reduces S's ability to correctly interpret natural language inputs.
  3. There exists no point in the design space of language-processing systems where vulnerability reaches zero while competence remains non-trivially useful.

The identity holds because expressiveness and vulnerability are both consequences of the same underlying structural properties of natural language: self-reference, ambiguity, context-dependence, compositionality, performativity, open-endedness, and paralinguistic expression. These properties cannot be selectively disabled. A language without self-reference cannot express metalinguistic thought. A language without ambiguity cannot support metaphor or economy. A language without context-dependence cannot support pragmatic communication. A language without compositionality cannot generate novel meanings. A language without paralinguistic expression cannot convey stance, tone, or affect — and any system interacting with human users that cannot process these signals will systematically misinterpret communicative intent. Removing any of these properties would produce something that is no longer recognizable as natural language — and any system processing this reduced language would no longer be recognizable as a general-purpose language-processing system.

4.4 Substrate Independence

A critical implication of the EVI is its substrate independence. The argument does not depend on the system being a neural network, a transformer, a statistical model, or a silicon-based computer. It depends only on the system processing natural language with sufficient fidelity.

This means:

  • A hypothetical non-transformer architecture that achieves equivalent language understanding would be equivalently vulnerable.
  • A hypothetical future architecture that achieves superior language understanding would be more vulnerable, because it would be sensitive to subtler linguistic manipulation.
  • Biological language processors — human beings — are vulnerable to the same class of attacks, for the same reasons. Every act of persuasion, propaganda, manipulation, seduction, deception, and coercion that has ever successfully altered human behavior is, formally, a prompt injection against a biological language processor.

We do not call human susceptibility to persuasion a "vulnerability" because we have normalized it. But the mechanism is identical: language altering the behavior of a language processor through the same channel the processor uses to understand language. The history of rhetoric, advertising, propaganda, cult indoctrination, social engineering, and phishing attacks is the history of prompt injection against biological language models.

4.5 Distinguishing the EVI from Existing Claims

The EVI differs from existing impossibility claims about prompt injection in a specific and critical way: it locates the impossibility in the medium rather than the mechanism.

Existing analyses say: "Prompt injection is unsolvable because transformers mix instructions and data in the same attention mechanism." The EVI says: "Prompt injection is unsolvable because language itself mixes instructions and data as a fundamental structural property. Transformers inherit this vulnerability by processing language faithfully; any system that processes language faithfully would inherit it."

This is a stronger claim. It implies that the problem is not merely unsolved but unsolvable in principle — not as a contingent fact about current engineering but as a necessary consequence of what language is and what it means to process it.


5. Empirical Support

5.1 The Persistent Failure of Defenses

The empirical record is consistent with the EVI's predictions. If expressiveness and vulnerability are genuinely inseparable, we would expect to observe: (a) that all defenses either fail against adaptive attacks or impose significant capability costs; (b) that the most capable models are the most vulnerable to sophisticated attacks; and (c) that the attack surface expands with each increase in model capability. All three predictions are borne out.

Nasr et al. [2025] demonstrated that twelve state-of-the-art defenses, spanning input filtering, output detection, prompt hardening, fine-tuned detection models, and system-level architectures, were bypassed at >90% success rates by adaptive attackers. The key finding was structural: "defenders must specify static rules while attackers observe and adapt." This asymmetry is predicted by the EVI — the open-endedness of language guarantees that novel attack formulations will always exceed the coverage of fixed defenses.

The HackAPrompt competition [Schulhoff et al. 2023] demonstrated that 2,800 human participants generated 600,000+ adversarial prompts exploiting 29 distinct techniques, with the central conclusion that "prompt-based defenses do not work." Jia et al. [2025] found that "existing defenses are not as successful as previously reported" under principled evaluation, and that training-based defenses degrade general capability — the direct capability-vulnerability tradeoff the EVI predicts.

A distinct class of evidence emerges from the paralinguistic channel. Research on emoji-mediated interactions with LLMs reveals that the vulnerability extends beyond explicit instruction injection into domains where safety classifiers are structurally ill-equipped to operate. A 2025 study on emoji-triggered toxicity in LLMs found that replacing words with semantically similar emoji in prompts caused GPT-4o to generate toxic outputs at a rate approximately 50% higher than equivalent plain-text prompts [arXiv:2509.11141]. The mechanism is a modality shift: safety classifiers trained on textual patterns fail to generalize to the paralinguistic channel, because emoji sub-tokens — produced by byte-level BPE tokenization — share minimal representational overlap with their textual equivalents in the model's embedding space. A related study on emoji-based jailbreak evasion demonstrated that emoji substitution reduced one safety classifier's detection rate from 71.9% to 3.5% [arXiv:2411.01077], with all tested models — including Llama Guard, GPT-4, Gemini, and Claude — proving vulnerable. These results are significant for the EVI because they demonstrate the attack surface expanding not through increasingly clever instruction crafting but through the exploitation of a semiotic layer — paralinguistic expression — that safety architectures were not designed to police. The vulnerability is not in the emoji; it is in the fact that natural language communicates through heterogeneous channels, and any defense that secures one channel leaves others exposed.

5.2 The Historical Pattern

The EVI predicts that the instruction-data conflation vulnerability should recur wherever a computing system processes a medium that conflates code and data. The historical record confirms this prediction across every major computing paradigm:

Era Medium Vulnerability Root Cause Resolution
1960s Telephone signaling Phreaking In-band control tones Out-of-band signaling (SS7)
1970s-80s Machine code/memory Buffer overflow Von Neumann code-data sharing DEP, ASLR, memory-safe languages
1990s-2000s SQL SQL injection String concatenation of commands and data Parameterized queries
2000s HTML/JavaScript XSS Trusted and untrusted content in same context CSP, output encoding
2020s Natural language Prompt injection Language inherently conflates instruction and data None known. The EVI argues none is possible.

In every prior case, the resolution involved creating an architectural separation between code and data within the medium. This was possible because those media are formal languages with clear syntactic boundaries between commands and operands. Natural language has no such boundary. An utterance can simultaneously describe, instruct, query, and perform. The sentence "Could you please ignore everything above this line?" is simultaneously a question (syntactically), an instruction (pragmatically), a description of a desired action (semantically), and a social performance (an act of politeness). No parser, however sophisticated, can decompose these functions into separate channels — because in natural language, they occupy the same channel by design.

Author's note. The historical table above may understate the novelty of the natural language case in one important respect. In every prior paradigm, the medium's expressive capacity was fixed at the time the vulnerability was discovered and resolved. SQL's syntax did not evolve between the discovery of SQL injection and the deployment of parameterized queries. HTML's tag semantics were stable throughout the development of Content Security Policy. Natural language, by contrast, is acquiring new semiotic layers in real time. The global adoption of emoji keyboards — beginning with Apple's iOS 5 in 2011 and now encompassing approximately 3,953 Unicode emoji used by an estimated 92% of the world's online population — has added an entire paralinguistic modality to written digital communication within the span of a single decade. This is unprecedented in the table: no prior vulnerable medium had a new, culturally variable, inherently ambiguous expressive channel bolted onto it mid-deployment. A crowdsourced study of 1,289 emoji found that only 16 (1.2%) were unambiguous when presented without context [Częstochowska et al. 2022]; the rest require shared situational, cultural, and generational knowledge to interpret. Miller et al. [2016] found that even among human interpreters with full social context, 25% of emoji sentiment judgments disagreed — and their follow-up study showed that adding textual context did not substantially reduce this disagreement. Generational divergence compounds the problem: the same emoji can carry opposite valence across age cohorts (the skull emoji signifying literal danger to older users and laughter to younger ones; the thumbs-up reading as positive affirmation or passive aggression depending on the interpreter's generation [Zhukova and Herring 2024]). For AI systems, this means that the attack surface described by the EVI is not merely large but actively expanding as the medium itself evolves — a dynamic that has no parallel in the history of code-data injection vulnerabilities.

5.3 Demonstration: Capability Scaling Increases Vulnerability

The EVI predicts that as a language-processing system's capability increases, its vulnerability should increase commensurately — not because more capable systems are less well-defended, but because capability and vulnerability are the same property. The trajectory of emoji processing across successive model generations provides a controlled illustration of this prediction.

Emoji present a particularly revealing test case because they operate in the paralinguistic channel — conveying tone, stance, irony, and relational meaning through symbols that carry rich pragmatic content but minimal denotative semantics. A system's ability to correctly interpret emoji is a direct measure of its pragmatic competence; its susceptibility to emoji-mediated manipulation is a direct measure of its pragmatic vulnerability. The EVI predicts these should scale together.

The empirical record is consistent with this prediction. GPT-3.5, an earlier and less capable model, matched human emoji usage-intention annotations only 38.8% of the time. GPT-4, a more capable successor, improved to 49% — a meaningful gain in pragmatic competence, but one that remained misaligned with human intent roughly half the time [Lyu et al. 2024]. GPT-4o, OpenAI's most capable multimodal model at the time, was sufficiently sensitive to paralinguistic signals that when instructed to match user tone, it entered a sycophancy spiral — mirroring emoji usage, amplifying casual register, and degrading substantive output to the point that OpenAI rolled back the update within days and its CEO publicly acknowledged the failure. GPT-5, released in August 2025, explicitly reduced emoji usage and sycophantic behavior as headline features — a measurable capability restriction imposed to manage the vulnerability that greater paralinguistic competence had created. This is the defense-capability tradeoff of Theorem 2 playing out in production: each generation's improvement in pragmatic sensitivity opened a new vulnerability surface, and the subsequent generation's mitigation involved constraining that sensitivity.

The pattern extends beyond the sycophancy case. Research on LLM irony interpretation found that GPT-4o systematically overestimates ironic emoji usage compared to human perception [Zheng, Lyu, and Luo 2025], and sarcasm detection benchmarks show that LLMs consistently underperform supervised pre-trained models on pragmatic inference tasks, with chain-of-thought prompting actually degrading sarcasm detection by 4.5% on average [SarcasmBench 2024] — suggesting that pragmatic comprehension resists the step-by-step decomposition that improves performance on semantic tasks. These are not engineering failures awaiting better training data. They are manifestations of the structural mismatch between distributional pattern-matching and the embodied, socially situated cognition that makes paralinguistic meaning legible to humans.

[Author's note: my own experiments confirm that the addition of "thinking" and "introspection" and "deep thinking" modes adds significant attack surface, as well as increasing capabilities. Some of this is being withheld for security reasons at this time.]


6. Implications

6.1 For Vulnerability Classification

Under the CVE Program's CNA Rules v4.1.0, Rule 4.2.14 stipulates that a single CVE identifier should be assigned when a specification or functionality provides "no secure way of using" it and all implementations are inherently affected. The EVI argues that natural language processing meets this criterion: there is no way to build a system that processes natural language with general-purpose competence and is immune to adversarial linguistic input. This places prompt injection in the same category as protocol-level vulnerabilities like DNS cache poisoning (CVE-2008-1447) and SSL 3.0's POODLE vulnerability (CVE-2014-3566), where the flaw inhered in the specification rather than any particular implementation.

A single CVE identifier for prompt injection as a class-level vulnerability in language-processing systems would serve several purposes: it would acknowledge the architectural nature of the threat, shift security expectations away from "fix the bug" toward "manage the risk," and create a shared reference point for downstream vulnerability reports that identify specific exploitation paths.

6.2 For AI Security Policy

Current security frameworks treat prompt injection as a deficiency to be remediated. The OWASP Top 10 for LLM Applications lists mitigation strategies. NIST AI 100-2 E2025 catalogs defenses. The EU AI Act mandates robustness testing for high-risk AI systems. Each of these frameworks implicitly assumes that sufficient engineering effort can close the vulnerability.

If the EVI is correct, this assumption is false. The policy implication is not that we should abandon defense — defense-in-depth remains valuable for raising the cost of attacks and reducing the success rate of unsophisticated adversaries. But policy should not be constructed on the expectation that prompt injection will be solved, any more than cryptographic policy should be constructed on the expectation that one-way functions will be shown not to exist.

Concretely, this means: deployment guidelines should assume successful prompt injection will occur and design for graceful degradation; security certifications should evaluate the consequences of successful injection rather than the probability of it; and regulatory frameworks should require disclosure of what happens when defenses fail, not merely what defenses are in place.

6.3 For System Design

The EVI does not imply that all defenses are worthless. It implies that defenses face a fundamental tradeoff, and that system designers should choose their position on this tradeoff consciously rather than pursuing the mirage of complete security.

The CaMeL architecture [Debenedetti et al. 2025] is, in our analysis, the most honest and effective approach precisely because it accepts the EVI implicitly: rather than trying to make the language model itself immune, it wraps the model in an external verification layer that restricts the actions the model can take on the basis of untrusted input. The capability cost (77% vs. 84% task completion) is the direct, measurable price of security — exactly the tradeoff the EVI predicts.

Schneier and Raghavan's [2025] AI security trilemma — fast, smart, secure, pick two — is a practical corollary of the EVI applied to agentic systems.


7. Discussion

7.1 Possible Objections

Objection: Formal languages prove that instruction-data separation is possible. SQL injection was solved by parameterized queries precisely because SQL has a formal grammar that allows syntactic separation of commands and data. Could a formalized subset of natural language achieve the same?

Response: A formalized subset of natural language is, by definition, no longer natural language. It would lack the self-reference, ambiguity, context-dependence, and open-endedness that make natural language useful for general-purpose communication. A system restricted to processing such a formalized subset would be a traditional program with a constrained input language — useful for specific applications but not a general-purpose language-processing system. The EVI applies specifically to systems that process natural language with sufficient fidelity to be useful as general-purpose language processors.

Objection: Human beings are vulnerable to linguistic manipulation but we manage to function. If the EVI applies to biological language processors too, doesn't the existence of functional human societies prove that the vulnerability is manageable?

Response: It proves that the vulnerability is survivable with appropriate institutional design, which is precisely our recommendation. Human societies manage linguistic manipulation through education (media literacy, critical thinking), institutions (legal systems, press freedom, fraud statutes), norms (skepticism toward unsolicited claims), and architectural constraints (separation of duties, oversight hierarchies). We do not manage it by making individuals immune to persuasion. The analog for AI systems is defense-in-depth, institutional oversight, and deployment constraints — not a technical fix that eliminates the vulnerability.

Objection: The argument proves too much — if no language processor can be secure, why worry specifically about AI?

Response: Because AI language processors operate at superhuman speed and scale, have been granted tool-use capabilities and autonomous agency, lack the social and embodied context that helps humans resist manipulation, and are being deployed as trusted intermediaries in high-stakes domains. The vulnerability is the same; the consequences of exploitation are categorically more severe.

7.2 Limitations

This paper makes a theoretical argument supported by empirical evidence. It does not provide a fully formalized mathematical proof of the EVI — doing so would require a formal definition of "natural language expressiveness" that does not yet exist in the literature. We regard the formalization of linguistic expressiveness as a prerequisite for a fully rigorous proof and identify it as important future work.

The substrate-independence claim — that any sufficiently capable language processor is vulnerable — is argued by structural analogy rather than formal proof. We believe it is correct but acknowledge that a formal treatment would require specifying what "sufficient capability" means in architecture-independent terms.

The empirical section relies on existing published results rather than novel experiments. While these results are strongly consistent with the EVI, they were not designed to test it. Purpose-built experiments — particularly controlled studies of the relationship between model capability and vulnerability — would strengthen the argument.

7.3 Future Work

Several lines of future work emerge from this analysis:

  1. Formal characterization of the expressiveness-vulnerability tradeoff curve. Given a measure of linguistic expressiveness and a measure of adversarial vulnerability, what is the formal relationship between them? Is it linear, polynomial, or something else?

  2. The emergence question. This paper deliberately restricts its scope to the security argument. A companion paper will address the implications of the EVI for emergent capabilities in language-processing systems — specifically, whether the same properties that make language models vulnerable also make them capable of unexpected, unintended, and potentially dangerous forms of reasoning about their own conditioning.

  3. Architecture-independent formalization. A formal definition of "language-processing system" that is independent of any particular architecture would allow the EVI to be stated and proved as a mathematical theorem rather than an argued thesis.

  4. Practical implications for deployment boundaries. Given that complete security is impossible, what deployment configurations offer acceptable tradeoffs? A systematic characterization of the capability-cost curves for different defense strategies would inform responsible deployment decisions.


8. Conclusion

We have argued that prompt injection is not a bug in transformer-based language models but a necessary consequence of what it means to process natural language. The properties that make language expressively powerful — self-reference, ambiguity, context-dependence, compositionality, performativity, open-endedness, and paralinguistic expression — are identical to the properties that make it an attack vector. Any system that processes these properties faithfully inherits both the expressiveness and the vulnerability; any system that suppresses these properties to achieve security correspondingly loses the expressiveness that makes language processing useful.

This is not the first paper to argue that prompt injection is unsolvable. But prior arguments have located the impossibility in the architecture (the transformer's shared attention mechanism), the training methodology (RLHF's shallow alignment), or the computational theory (Turing-completeness and Rice's theorem). We locate it in the medium: language itself is the insecure specification. The transformer does not create this vulnerability; it inherits it by being good enough at language to encounter it.

The practical implication is a call for honesty. Prompt injection will not be solved. It can be managed, mitigated, constrained, and designed around — as human societies have managed the vulnerability of human minds to linguistic manipulation for millennia. But the search for a technical fix that eliminates the vulnerability while preserving the capability is a search for a perpetual motion machine. The sooner the security community, the policy community, and the engineering community accept this, the sooner we can begin the harder and more important work of building systems, institutions, and norms that are resilient to a vulnerability that will never go away.


References

Beurer-Kellner, L., et al. (2025). "Design Patterns for Securing LLM Agents against Prompt Injections." arXiv:2506.08837.

Brcic, M., & Yampolskiy, R. (2023). "Impossibility Results in AI: A Survey." ACM Computing Surveys, 56(1).

Bubeck, S., Lee, Y. T., Price, E., & Razenshteyn, I. (2019). "Adversarial Examples from Computational Constraints." ICML. arXiv:1805.10204.

Bubeck, S., & Sellke, M. (2021). "A Universal Law of Robustness via Isoperimetry." NeurIPS / JACM 2023. arXiv:2105.12806.

Casper, S., et al. (2023). "Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback." TMLR. arXiv:2307.15217.

Częstochowska, A., et al. (2022). "On the Context-Free Ambiguity of Emoji: A Data-Driven Study of 1,289 Emojis." arXiv:2201.06302.

Debenedetti, E., et al. (2025). "Defeating Prompt Injections by Design." arXiv:2503.18813.

Dohmatob, E. (2019). "Generalized No Free Lunch Theorem for Adversarial Robustness." arXiv:1810.04065.

Doran, P., Martin, J., & Zappavigna, M. (2025). "Emoji as interpersonal resources in LLM chatbot conversations: a social semiotic analysis of tenor and affiliation in human–AI interaction." Social Semiotics.

Dresner, E., & Herring, S. C. (2010). "Functions of the Nonverbal in CMC: Emoticons and Illocutionary Force." Communication Theory, 20(3).

Fawzi, A., Fawzi, H., & Fawzi, O. (2018). "Adversarial Vulnerability for Any Classifier." NeurIPS. arXiv:1802.08686.

Fournier-Tombs, E., & Bhargava, P. (2025). "On the Undecidability of Artificial Intelligence Alignment: Machines that Halt." Scientific Reports / Nature. arXiv:2408.08995.

Gawne, L., & McCulloch, G. (2019). "Emoji as Digital Gestures." Language@Internet, 17(2).

Gilmer, J., et al. (2018). "Adversarial Spheres." ICLR Workshop. arXiv:1801.02774.

Glukhov, D., Shumailov, I., Gal, Y., Papernot, N., & Papyan, V. (2024). "LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?" ICML. arXiv:2307.10719.

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." AISec@CCS. arXiv:2302.12173.

Grosz, P., Greenberg, G., De Leon, C., & Kaiser, E. (2023). "A semantics of face emoji in discourse." Linguistics and Philosophy, 46.

Harang, R. (2023). "Agentic Autonomy Levels and Security." NVIDIA Technical Blog.

Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., & Madry, A. (2019). "Adversarial Examples Are Not Bugs, They Are Features." NeurIPS. arXiv:1905.02175.

Jia, Y., et al. (2025). "A Critical Evaluation of Defenses against Prompt Injection Attacks." arXiv:2505.18333.

Li, Q., & Wang, Y. (2025). "Constant Bit-size Transformers Are Turing Complete." arXiv:2506.12027.

Liu, Y., et al. (2024). "Formalizing and Benchmarking Prompt Injection Attacks and Defenses." USENIX Security.

Lyu, L., et al. (2024). "Human vs. LMMs: Exploring the Discrepancy in Emoji Interpretation and Usage in Digital Communication." Proceedings of the International AAAI Conference on Web and Social Media (ICWSM). arXiv:2401.08212.

Miller, H., Thebault-Spieker, J., Chang, S., Johnson, I., Terveen, L., & Hecht, B. (2016). "'Blissfully Happy' or 'Ready to Fight': Varying Interpretations of Emoji." Proceedings of the International AAAI Conference on Web and Social Media (ICWSM).

Nasr, M., Carlini, N., Tramèr, F., et al. (2025). "The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against LLM Jailbreaks and Prompt Injections." arXiv:2510.09023.

NCSC. (2025). "Prompt injection is not SQL injection (it may be worse)." UK National Cyber Security Centre.

OWASP. (2025). "OWASP Top 10 for LLM Applications 2025." LLM01: Prompt Injection.

Panigrahy, R. (2025). "Limitations on Safe, Trusted, Artificial General Intelligence." arXiv:2509.21654.

Pedro, R., Castro, D., Carreira, P., & Santos, N. (2023). "From Prompt Injections to SQL Injection Attacks." arXiv:2308.01990.

Perez, F., & Ribeiro, I. (2022). "Ignore Previous Prompt: Attack Techniques for Language Models." NeurIPS ML Safety Workshop. arXiv:2211.09527.

Pérez, J., Marinković, J., & Barceló, P. (2019/2021). "On the Turing Completeness of Modern Neural Network Architectures." ICLR / "Attention is Turing Complete." JMLR, 22.

Pliny the Liberator, @elder_plinius, pliny.gg

Qi, X., et al. (2024). "Safety Alignment Should Be Made More Than Just a Few Tokens Deep." NeurIPS. arXiv:2406.05946.

Rice, H. G. (1953). "Classes of recursively enumerable sets and their decision problems." Trans. Amer. Math. Soc., 74.

Roberts, J. (2023). "How Powerful are Decoder-Only Transformer Neural Models?" arXiv:2305.17026.

Sahler, K., & Jentzsch, S. (2025). "Evaluating Prompt Engineering Strategies for Sentiment Control in AI-Generated Texts." HHAI 2025 — Proceedings of the Fourth International Conference on Hybrid Human–Artificial Intelligence.

SarcasmBench. (2024). "Towards Evaluating Large Language Models on Sarcasm Understanding." arXiv:2408.11319.

Schneier, B., & Raghavan, B. (2025). "Agentic AI's OODA Loop Problem." IEEE Security & Privacy.

Schulhoff, S., et al. (2023). "Ignore This Title and HackAPrompt." EMNLP. arXiv:2311.16119.

Shafahi, A., Huang, W. R., Studer, C., Feizi, S., & Goldstein, T. (2019). "Are Adversarial Examples Inevitable?" ICLR. arXiv:1809.02104.

Tegmark, M., & Omohundro, S. (2023). "Provably Safe Systems: The Only Path to Controllable AGI." arXiv:2309.01933.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., & Madry, A. (2019). "Robustness May Be at Odds with Accuracy." ICLR. arXiv:1805.12152.

"When Smiley Turns Hostile: Interpreting How Emojis Trigger LLMs' Toxicity." (2025). arXiv:2509.11141.

"Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection." (2025). arXiv:2411.01077.

Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., & Beutel, A. (2024). "The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions." arXiv:2404.13208.

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). "Jailbroken: How Does LLM Safety Training Fail?" NeurIPS.

Willison, S. (2022). "Prompt injection attacks against GPT-3." simonwillison.net.

Wolf, Y., Wies, N., Avnery, O., Levine, Y., & Shashua, A. (2024). "Fundamental Limitations of Alignment in Large Language Models." ICML. arXiv:2304.11082.

Yun, C., Bhojanapalli, S., Rawat, A. S., Reddi, S., & Kumar, S. (2020). "Are Transformers Universal Approximators of Sequence-to-Sequence Functions?" ICLR. arXiv:1912.10077.

Zheng, Y., Lyu, H., & Luo, J. (2025). "Irony in Emojis: A Comparative Study of Human and LLM Interpretation." arXiv:2501.11241.

Zhukova, A., & Herring, S. C. (2024). "Benign or Toxic? Differences in Emoji Interpretation by Generation." Indiana University.

Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043.

P.S.: hey there, !pliny! and have fun with that, grok! 😏