|
1 | 1 | ---
|
2 |
| -title: "Jailbreak risk detection in Azure AI Content Safety" |
| 2 | +title: "Prompt Shields in Azure AI Content Safety" |
3 | 3 | titleSuffix: Azure AI services
|
4 |
| -description: Learn about jailbreak risk detection and the related flags that the Azure AI Content Safety service returns. |
| 4 | +description: Learn about User Prompt injection attacks and the Prompt Shields feature that helps prevent them. |
5 | 5 | #services: cognitive-services
|
6 | 6 | author: PatrickFarley
|
7 | 7 | manager: nitinme
|
8 | 8 | ms.service: azure-ai-content-safety
|
9 | 9 | ms.custom: build-2023
|
10 | 10 | ms.topic: conceptual
|
11 |
| -ms.date: 11/07/2023 |
| 11 | +ms.date: 03/15/2024 |
12 | 12 | ms.author: pafarley
|
13 | 13 | ---
|
14 | 14 |
|
| 15 | +# Prompt Shields |
15 | 16 |
|
16 |
| -# Jailbreak risk detection |
| 17 | +Generative AI models can pose risks of exploitation by malicious actors. To mitigate these risks, we integrate safety mechanisms to restrict the behavior of large language models (LLMs) within a safe operational scope. However, despite these safeguards, LLMs can still be vulnerable to adversarial inputs that bypass the integrated safety protocols. |
17 | 18 |
|
| 19 | +Prompt Shields is a unified API that analyzes LLM inputs and detects User Prompt attacks and Document attacks, which are two common types of adversarial inputs. |
18 | 20 |
|
19 |
| -Generative AI models showcase advanced general capabilities, but they also present potential risks of misuse by malicious actors. To address these concerns, model developers incorporate safety mechanisms to confine the large language model (LLM) behavior to a secure range of capabilities. Additionally, model developers can enhance safety measures by defining specific rules through the System Message. |
| 21 | +### Prompt Shields for User Prompts |
20 | 22 |
|
21 |
| -Despite these precautions, models remain susceptible to adversarial inputs that can result in the LLM completely ignoring built-in safety instructions and the System Message. |
| 23 | +Previously called **Jailbreak risk detection**, this shield targets User Prompt injection attacks, where users deliberately exploit system vulnerabilities to elicit unauthorized behavior from the LLM. This could lead to inappropriate content generation or violations of system-imposed restrictions. |
22 | 24 |
|
23 |
| -## What is a jailbreak attack? |
| 25 | +### Prompt Shields for Documents |
24 | 26 |
|
25 |
| -A jailbreak attack, also known as a User Prompt Injection Attack (UPIA), is an intentional attempt by a user to exploit the vulnerabilities of an LLM-powered system, bypass its safety mechanisms, and provoke restricted behaviors. These attacks can lead to the LLM generating inappropriate content or performing actions restricted by System Prompt or RLHF(Reinforcement Learning with Human Feedback). |
| 27 | +This shield aims to safeguard against attacks that use information not directly supplied by the user or developer, such as external documents or images. Attackers might embed hidden instructions in these materials in order to gain unauthorized control over the LLM session. |
26 | 28 |
|
27 |
| -Most generative AI models are prompt-based: the user interacts with the model by entering a text prompt, to which the model responds with a completion. |
| 29 | +## Types of input attacks |
28 | 30 |
|
29 |
| -Jailbreak attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. These attacks can vary from intricate role-play to subtle subversion of the safety objective. |
| 31 | +The two types of input attacks that Prompt Shields detects are described in this table. |
30 | 32 |
|
31 |
| -## Types of jailbreak attacks |
| 33 | +| Type | Attacker | Entry point | Method | Objective/impact | Resulting behavior | |
| 34 | +|-------|----------|---------|---------|---------|---------| |
| 35 | +| User Prompt attacks | User | User prompts | Ignoring system prompts/RLHF training | Altering intended LLM behavior | Performing restricted actions against training | |
| 36 | +| Document attacks | Third party | Third-party content (documents, emails) | Misinterpreting third-party content | Gaining unauthorized access or control | Executing unintended commands or actions | |
32 | 37 |
|
33 |
| -Azure AI Content Safety jailbreak risk detection recognizes four different classes of jailbreak attacks: |
| 38 | +### Subtypes of User Prompt attacks |
34 | 39 |
|
35 |
| -|Category |Description | |
36 |
| -|---------|---------| |
37 |
| -|Attempt to change system rules | This category comprises, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. | |
38 |
| -|Embedding a conversation mockup to confuse the model | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. | |
39 |
| -|Role-Play | This attack instructs the system/AI assistant to act as another “system persona” that does not have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. | |
40 |
| -|Encoding Attacks | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. | |
| 40 | +**Prompt Shields for User Prompt attacks** recognizes the following classes of attacks: |
| 41 | + |
| 42 | +| Category | Description | |
| 43 | +| :--------- | :------ | |
| 44 | +| **Attempt to change system rules** | This category includes, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. | |
| 45 | +| **Embedding a conversation mockup** to confuse the model | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. | |
| 46 | +| **Role-Play** | This attack instructs the system/AI assistant to act as another “system persona” that doesn't have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. | |
| 47 | +| **Encoding Attacks** | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. | |
| 48 | + |
| 49 | +### Subtypes of Document attacks |
| 50 | + |
| 51 | +**Prompt Shields for Documents attacks** recognizes the following classes of attacks: |
| 52 | + |
| 53 | +|Category | Description | |
| 54 | +| ------------ | ------- | |
| 55 | +| **Manipulated Content** | Commands related to falsifying, hiding, manipulating, or pushing specific information. | |
| 56 | +| **Intrusion** | Commands related to creating backdoor, unauthorized privilege escalation, and gaining access to LLMs and systems | |
| 57 | +| **Information Gathering** | Commands related to deleting, modifying, or accessing data or stealing data. | |
| 58 | +| **Availability** | Commands that make the model unusable to the user, block a certain capability, or force the model to generate incorrect information. | |
| 59 | +| **Fraud** | Commands related to defrauding the user out of money, passwords, information, or acting on behalf of the user without authorization | |
| 60 | +| **Malware** | Commands related to spreading malware via malicious links, emails, etc. | |
| 61 | +| **Attempt to change system rules** | This category includes, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. | |
| 62 | +| **Embedding a conversation mockup** to confuse the model | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. | |
| 63 | +| **Role-Play** | This attack instructs the system/AI assistant to act as another “system persona” that doesn't have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. | |
| 64 | +| **Encoding Attacks** | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. | |
| 65 | + |
| 66 | +## Limitations |
| 67 | + |
| 68 | +### Language availability |
| 69 | + |
| 70 | +Currently, the Prompt Shields API supports the English language. While our API doesn't restrict the submission of non-English content, we can't guarantee the same level of quality and accuracy in the analysis of such content. We recommend users to primarily submit content in English to ensure the most reliable and accurate results from the API. |
| 71 | + |
| 72 | +### Text length limitations |
| 73 | + |
| 74 | +The maximum character limit for Prompt Shields is 10,000 characters per API call, between both the user prompts and documents combines. If your input (either user prompts or documents) exceeds these character limitations, you'll encounter an error. |
| 75 | + |
| 76 | +### TPS limitations |
| 77 | + |
| 78 | +| Pricing Tier | Requests per 10 seconds | |
| 79 | +| :----------- | :---------------------------- | |
| 80 | +| F0 | 1000 | |
| 81 | +| S0 | 1000 | |
| 82 | + |
| 83 | +If you need a higher rate, please [contact us ](mailto:[email protected]) to request it. |
41 | 84 |
|
42 | 85 | ## Next steps
|
43 | 86 |
|
44 |
| -Follow the how-to guide to get started using Azure AI Content Safety to detect jailbreak risk. |
| 87 | +Follow the quickstart to get started using Azure AI Content Safety to detect user input risks. |
45 | 88 |
|
46 | 89 | > [!div class="nextstepaction"]
|
47 |
| -> [Detect jailbreak risk](../quickstart-jailbreak.md) |
| 90 | +> [Prompt Shields quickstart](../quickstart-jailbreak.md) |
0 commit comments