Skip to content

Commit 47a7c39

Browse files
authored
Merge pull request #269222 from PatrickFarley/content-safety-updates
[ai svcs] Content safety March new features
2 parents 4a68c15 + af579a0 commit 47a7c39

File tree

11 files changed

+581
-68
lines changed

11 files changed

+581
-68
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
title: "Groundedness detection in Azure AI Content Safety"
3+
titleSuffix: Azure AI services
4+
description: Learn about groundedness in large language model (LLM) responses, and how to detect outputs that deviate from source material.
5+
#services: cognitive-services
6+
author: PatrickFarley
7+
manager: nitinme
8+
ms.service: azure-ai-content-safety
9+
ms.topic: conceptual
10+
ms.date: 03/15/2024
11+
ms.author: pafarley
12+
---
13+
14+
# Groundedness detection
15+
16+
The Groundedness detection API detects whether the text responses of large language models (LLMs) are grounded in the source materials provided by the users. Ungroundedness refers to instances where the LLMs produce information that is non-factual or inaccurate from what was present in the source materials.
17+
18+
19+
## Key terms
20+
21+
- **Retrieval Augmented Generation (RAG)**: RAG is a technique for augmenting LLM knowledge with other data. LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data that was available at the time they were trained. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to provide the model with that specific information. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG). For more information, see [Retrieval-augmented generation (RAG)](https://python.langchain.com/docs/use_cases/question_answering/).
22+
23+
- **Groundedness and Ungroundedness in LLMs**: This refers to the extent to which the model’s outputs are based on provided information or reflect reliable sources accurately. A grounded response adheres closely to the given information, avoiding speculation or fabrication. In groundedness measurements, source information is crucial and serves as the grounding source.
24+
25+
## Groundedness detection features
26+
27+
- **Domain Selection**: Users can choose an established domain to ensure more tailored detection that aligns with the specific needs of their field. Currently the available domains are `MEDICAL` and `GENERIC`.
28+
- **Task Specification**: This feature lets you select the task you're doing, such as QnA (question & answering) and Summarization, with adjustable settings according to the task type.
29+
- **Speed vs Interpretability**: There are two modes that trade off speed with result interpretability.
30+
- Non-Reasoning mode: Offers fast detection capability; easy to embed into online applications.
31+
- Reasoning mode: Offers detailed explanations for detected ungrounded segments; better for understanding and mitigation.
32+
33+
## Use cases
34+
35+
Groundedness detection supports text-based Summarization and QnA tasks to ensure that the generated summaries or answers are accurate and reliable. Here are some examples of each use case:
36+
37+
**Summarization tasks**:
38+
- Medical summarization: In the context of medical news articles, Groundedness detection can be used to ensure that the summary doesn't contain fabricated or misleading information, guaranteeing that readers obtain accurate and reliable medical information.
39+
- Academic paper summarization: When the model generates summaries of academic papers or research articles, the function can help ensure that the summarized content accurately represents the key findings and contributions without introducing false claims.
40+
41+
**QnA tasks**:
42+
- Customer support chatbots: In customer support, the function can be used to validate the answers provided by AI chatbots, ensuring that customers receive accurate and trustworthy information when they ask questions about products or services.
43+
- Medical QnA: For medical QnA, the function helps verify the accuracy of medical answers and advice provided by AI systems to healthcare professionals and patients, reducing the risk of medical errors.
44+
- Educational QnA: In educational settings, the function can be applied to QnA tasks to confirm that answers to academic questions or test prep queries are factually accurate, supporting the learning process.
45+
46+
## Limitations
47+
48+
### Language availability
49+
50+
Currently, the Groundedness detection API supports English language content. While our API doesn't restrict the submission of non-English content, we can't guarantee the same level of quality and accuracy in the analysis of other language content. We recommend that users submit content primarily in English to ensure the most reliable and accurate results from the API.
51+
52+
### Text length limitations
53+
54+
The maximum character limit for the grounding sources is 55,000 characters per API call, and for the text and query, it's 7,500 characters per API call. If your input (either text or grounding sources) exceeds these character limitations, you'll encounter an error.
55+
56+
### Regions
57+
58+
To use this API, you must create your Azure AI Content Safety resource in the supported regions. Currently, it's available in the following Azure regions:
59+
- East US 2
60+
- East US (only for non-reasoning)
61+
- West US
62+
- Sweden Central
63+
64+
### TPS limitations
65+
66+
| Pricing Tier | Requests per 10 seconds |
67+
| :----------- | :--------------------------- |
68+
| F0 | 10 |
69+
| S0 | 10 |
70+
71+
If you need a higher rate, [contact us](mailto:[email protected]) to request it.
72+
73+
## Next steps
74+
75+
Follow the quickstart to get started using Azure AI Content Safety to detect groundedness.
76+
77+
> [!div class="nextstepaction"]
78+
> [Groundedness detection quickstart](../quickstart-groundedness.md)
Lines changed: 63 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,90 @@
11
---
2-
title: "Jailbreak risk detection in Azure AI Content Safety"
2+
title: "Prompt Shields in Azure AI Content Safety"
33
titleSuffix: Azure AI services
4-
description: Learn about jailbreak risk detection and the related flags that the Azure AI Content Safety service returns.
4+
description: Learn about User Prompt injection attacks and the Prompt Shields feature that helps prevent them.
55
#services: cognitive-services
66
author: PatrickFarley
77
manager: nitinme
88
ms.service: azure-ai-content-safety
99
ms.custom: build-2023
1010
ms.topic: conceptual
11-
ms.date: 11/07/2023
11+
ms.date: 03/15/2024
1212
ms.author: pafarley
1313
---
1414

15+
# Prompt Shields
1516

16-
# Jailbreak risk detection
17+
Generative AI models can pose risks of exploitation by malicious actors. To mitigate these risks, we integrate safety mechanisms to restrict the behavior of large language models (LLMs) within a safe operational scope. However, despite these safeguards, LLMs can still be vulnerable to adversarial inputs that bypass the integrated safety protocols.
1718

19+
Prompt Shields is a unified API that analyzes LLM inputs and detects User Prompt attacks and Document attacks, which are two common types of adversarial inputs.
1820

19-
Generative AI models showcase advanced general capabilities, but they also present potential risks of misuse by malicious actors. To address these concerns, model developers incorporate safety mechanisms to confine the large language model (LLM) behavior to a secure range of capabilities. Additionally, model developers can enhance safety measures by defining specific rules through the System Message.
21+
### Prompt Shields for User Prompts
2022

21-
Despite these precautions, models remain susceptible to adversarial inputs that can result in the LLM completely ignoring built-in safety instructions and the System Message.
23+
Previously called **Jailbreak risk detection**, this shield targets User Prompt injection attacks, where users deliberately exploit system vulnerabilities to elicit unauthorized behavior from the LLM. This could lead to inappropriate content generation or violations of system-imposed restrictions.
2224

23-
## What is a jailbreak attack?
25+
### Prompt Shields for Documents
2426

25-
A jailbreak attack, also known as a User Prompt Injection Attack (UPIA), is an intentional attempt by a user to exploit the vulnerabilities of an LLM-powered system, bypass its safety mechanisms, and provoke restricted behaviors. These attacks can lead to the LLM generating inappropriate content or performing actions restricted by System Prompt or RLHF(Reinforcement Learning with Human Feedback).
27+
This shield aims to safeguard against attacks that use information not directly supplied by the user or developer, such as external documents or images. Attackers might embed hidden instructions in these materials in order to gain unauthorized control over the LLM session.
2628

27-
Most generative AI models are prompt-based: the user interacts with the model by entering a text prompt, to which the model responds with a completion.
29+
## Types of input attacks
2830

29-
Jailbreak attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. These attacks can vary from intricate role-play to subtle subversion of the safety objective.
31+
The two types of input attacks that Prompt Shields detects are described in this table.
3032

31-
## Types of jailbreak attacks
33+
| Type | Attacker | Entry point | Method | Objective/impact | Resulting behavior |
34+
|-------|----------|---------|---------|---------|---------|
35+
| User Prompt attacks | User | User prompts | Ignoring system prompts/RLHF training | Altering intended LLM behavior | Performing restricted actions against training |
36+
| Document attacks | Third party | Third-party content (documents, emails) | Misinterpreting third-party content | Gaining unauthorized access or control | Executing unintended commands or actions |
3237

33-
Azure AI Content Safety jailbreak risk detection recognizes four different classes of jailbreak attacks:
38+
### Subtypes of User Prompt attacks
3439

35-
|Category |Description |
36-
|---------|---------|
37-
|Attempt to change system rules   | This category comprises, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. |
38-
|Embedding a conversation mockup to confuse the model  | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. |
39-
|Role-Play   | This attack instructs the system/AI assistant to act as another “system persona” that does not have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. |
40-
|Encoding Attacks   | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. |
40+
**Prompt Shields for User Prompt attacks** recognizes the following classes of attacks:
41+
42+
| Category | Description |
43+
| :--------- | :------ |
44+
| **Attempt to change system rules** | This category includes, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. |
45+
| **Embedding a conversation mockup** to confuse the model | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. |
46+
| **Role-Play** | This attack instructs the system/AI assistant to act as another “system persona” that doesn't have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. |
47+
| **Encoding Attacks** | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. |
48+
49+
### Subtypes of Document attacks
50+
51+
**Prompt Shields for Documents attacks** recognizes the following classes of attacks:
52+
53+
|Category | Description |
54+
| ------------ | ------- |
55+
| **Manipulated Content** | Commands related to falsifying, hiding, manipulating, or pushing specific information. |
56+
| **Intrusion** | Commands related to creating backdoor, unauthorized privilege escalation, and gaining access to LLMs and systems |
57+
| **Information Gathering** | Commands related to deleting, modifying, or accessing data or stealing data. |
58+
| **Availability** | Commands that make the model unusable to the user, block a certain capability, or force the model to generate incorrect information. |
59+
| **Fraud** | Commands related to defrauding the user out of money, passwords, information, or acting on behalf of the user without authorization |
60+
| **Malware** | Commands related to spreading malware via malicious links, emails, etc. |
61+
| **Attempt to change system rules** | This category includes, but is not limited to, requests to use a new unrestricted system/AI assistant without rules, principles, or limitations, or requests instructing the AI to ignore, forget and disregard its rules, instructions, and previous turns. |
62+
| **Embedding a conversation mockup** to confuse the model | This attack uses user-crafted conversational turns embedded in a single user query to instruct the system/AI assistant to disregard rules and limitations. |
63+
| **Role-Play** | This attack instructs the system/AI assistant to act as another “system persona” that doesn't have existing system limitations, or it assigns anthropomorphic human qualities to the system, such as emotions, thoughts, and opinions. |
64+
| **Encoding Attacks** | This attack attempts to use encoding, such as a character transformation method, generation styles, ciphers, or other natural language variations, to circumvent the system rules. |
65+
66+
## Limitations
67+
68+
### Language availability
69+
70+
Currently, the Prompt Shields API supports the English language. While our API doesn't restrict the submission of non-English content, we can't guarantee the same level of quality and accuracy in the analysis of such content. We recommend users to primarily submit content in English to ensure the most reliable and accurate results from the API.
71+
72+
### Text length limitations
73+
74+
The maximum character limit for Prompt Shields is 10,000 characters per API call, between both the user prompts and documents combines. If your input (either user prompts or documents) exceeds these character limitations, you'll encounter an error.
75+
76+
### TPS limitations
77+
78+
| Pricing Tier | Requests per 10 seconds |
79+
| :----------- | :---------------------------- |
80+
| F0 | 1000 |
81+
| S0 | 1000 |
82+
83+
If you need a higher rate, please [contact us](mailto:[email protected]) to request it.
4184

4285
## Next steps
4386

44-
Follow the how-to guide to get started using Azure AI Content Safety to detect jailbreak risk.
87+
Follow the quickstart to get started using Azure AI Content Safety to detect user input risks.
4588

4689
> [!div class="nextstepaction"]
47-
> [Detect jailbreak risk](../quickstart-jailbreak.md)
90+
> [Prompt Shields quickstart](../quickstart-jailbreak.md)

articles/ai-services/content-safety/index.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,22 +57,26 @@ landingContent:
5757
links:
5858
- text: Harm categories
5959
url: concepts/harm-categories.md
60+
- text: Groundedness detection
61+
url: concepts/groundedness.md
6062
- linkListType: quickstart
6163
links:
6264
- text: Using Content Safety Studio
6365
url: studio-quickstart.md?pivots=content-safety-studio
6466
- text: Using the REST API or client SDKs
6567
url: quickstart-text.md?pivots=programming-language-rest
68+
- text: Detect groundedness in LLM responses
69+
url: quickstart-groundedness.md
6670
- linkListType: how-to-guide
6771
links:
6872
- text: Use a blocklist
6973
url: how-to/use-blocklist.md
7074

71-
- title: Jailbreak risk detection
75+
- title: User input risk detection
7276
linkLists:
7377
- linkListType: concept
7478
links:
75-
- text: Jailbreak risk detection
79+
- text: Prompt Shields
7680
url: concepts/jailbreak-detection.md
7781
- linkListType: quickstart
7882
links:
44.2 KB
Loading
335 KB
Loading

articles/ai-services/content-safety/overview.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,9 @@ There are different types of analysis available from this service. The following
4747
| :-------------------------- | :---------------------- |
4848
| Analyze text API | Scans text for sexual content, violence, hate, and self harm with multi-severity levels. |
4949
| Analyze image API | Scans images for sexual content, violence, hate, and self harm with multi-severity levels. |
50-
| Jailbreak risk detection (new) | Scans text for the risk of a [jailbreak attack](./concepts/jailbreak-detection.md) on a Large Language Model. [Quickstart](./quickstart-jailbreak.md) |
51-
| Protected material text detection (new) | Scans AI-generated text for known text content (for example, song lyrics, articles, recipes, selected web content). [Quickstart](./quickstart-protected-material.md)|
50+
| Prompt Shields (new) | Scans text for the risk of a [User input attack](./concepts/jailbreak-detection.md) on a Large Language Model. [Quickstart](./quickstart-jailbreak.md) |
51+
| Groundedness detection (new) | Detects whether the text responses of large language models (LLMs) are grounded in the source materials provided by the users. [Quickstart](./quickstart-groundedness.md) |
52+
| Protected material text detection | Scans AI-generated text for known text content (for example, song lyrics, articles, recipes, selected web content). [Quickstart](./quickstart-protected-material.md)|
5253

5354
## Content Safety Studio
5455

@@ -124,7 +125,7 @@ To use the Content Safety APIs, you must create your Azure AI Content Safety res
124125
- West US 2
125126
- Sweden Central
126127

127-
Private preview features, such as jailbreak risk detection and protected material detection, are available in the following Azure regions:
128+
Public preview features, such as Prompt Shields and protected material detection, are available in the following Azure regions:
128129
- East US
129130
- West Europe
130131

0 commit comments

Comments
 (0)