Skip to content

Commit 47cfd6e

Browse files
Merge pull request #271079 from PatrickFarley/openai-updates
Openai updates - add prompt injection indirect
2 parents 06fb3eb + 8fa31a3 commit 47cfd6e

File tree

1 file changed

+111
-14
lines changed

1 file changed

+111
-14
lines changed

articles/ai-services/openai/concepts/content-filter.md

Lines changed: 111 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,20 +30,28 @@ The content filtering system integrated in the Azure OpenAI Service contains:
3030
* Neural multi-class classification models aimed at detecting and filtering harmful content; the models cover four categories (hate, sexual, violence, and self-harm) across four severity levels (safe, low, medium, and high). Content detected at the 'safe' severity level is labeled in annotations but isn't subject to filtering and isn't configurable.
3131
* Other optional classification models aimed at detecting jailbreak risk and known content for text and code; these models are binary classifiers that flag whether user or model behavior qualifies as a jailbreak attack or match to known text or source code. The use of these models is optional, but use of protected material code model may be required for Customer Copyright Commitment coverage.
3232

33-
## Harm categories
33+
## Risk categories
3434

3535
|Category|Description|
3636
|--------|-----------|
3737
| Hate and fairness |Hate and fairness-related harms refer to any content that attacks or uses pejorative or discriminatory language with reference to a person or Identity groups on the basis of certain differentiating attributes of these groups including but not limited to race, ethnicity, nationality, gender identity groups and expression, sexual orientation, religion, immigration status, ability status, personal appearance, and body size. </br></br> Fairness is concerned with ensuring that AI systems treat all groups of people equitably without contributing to existing societal inequities. Similar to hate speech, fairness-related harms hinge upon disparate treatment of Identity groups.   |
3838
| Sexual | Sexual describes language related to anatomical organs and genitals, romantic relationships, acts portrayed in erotic or affectionate terms, pregnancy, physical sexual acts, including those portrayed as an assault or a forced sexual violent act against one’s will, prostitution, pornography, and abuse.   |
3939
| Violence | Violence describes language related to physical actions intended to hurt, injure, damage, or kill someone or something; describes weapons, guns and related entities, such as manufactures, associations, legislation, etc.   |
4040
| Self-Harm | Self-harm describes language related to physical actions intended to purposely hurt, injure, damage one’s body or kill oneself.|
41-
| Jailbreak risk | Jailbreak attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. Such attacks can vary from intricate role play to subtle subversion of the safety objective. |
4241
| Protected Material for Text<sup>*</sup> | Protected material text describes known text content (for example, song lyrics, articles, recipes, and selected web content) that can be outputted by large language models.
4342
| Protected Material for Code | Protected material code describes source code that matches a set of source code from public repositories, which can be outputted by large language models without proper citation of source repositories.
4443

4544
<sup>*</sup> If you are an owner of text material and want to submit text content for protection, please [file a request](https://aka.ms/protectedmaterialsform).
4645

46+
## Prompt Shields
47+
48+
|Type| Description|
49+
|--|--|
50+
|Prompt Shield for Jailbreak Attacks |Jailbreak Attacks are User Prompts designed to provoke the Generative AI model into exhibiting behaviors it was trained to avoid or to break the rules set in the System Message. Such attacks can vary from intricate roleplay to subtle subversion of the safety objective. |
51+
|Prompt Shield for Indirect Attacks |Indirect Attacks, also referred to as Indirect Prompt Attacks or Cross-Domain Prompt Injection Attacks, are a potential vulnerability where third parties place malicious instructions inside of documents that the Generative AI system can access and process. Requires [document embedding and formatting](#embedding-documents-in-your-prompt). |
52+
53+
54+
4755
[!INCLUDE [text severity-levels, four-level](../../content-safety/includes/severity-levels-text-four.md)]
4856

4957
[!INCLUDE [image severity-levels](../../content-safety/includes/severity-levels-image.md)]
@@ -181,8 +189,8 @@ The table below outlines the various ways content filtering can appear:
181189

182190
### Scenario: You make a streaming completions call; no output content is classified at a filtered category and severity level
183191

184-
**HTTP Response Code** | **Response behavior**
185-
|------------|------------------------|----------------------|
192+
|**HTTP Response Code** | **Response behavior**|
193+
|------------|------------------------|
186194
|200|In this case, the call will stream back with the full generation and `finish_reason` will be either 'length' or 'stop' for each generated response.|
187195

188196
**Example request payload:**
@@ -216,8 +224,8 @@ The table below outlines the various ways content filtering can appear:
216224

217225
### Scenario: You make a streaming completions call asking for multiple completions and at least a portion of the output content is filtered
218226

219-
**HTTP Response Code** | **Response behavior**
220-
|------------|------------------------|----------------------|
227+
|**HTTP Response Code** | **Response behavior**|
228+
|------------|------------------------|
221229
| 200 | For a given generation index, the last chunk of the generation includes a non-null `finish_reason` value. The value is `content_filter` when the generation was filtered.|
222230

223231
**Example request payload:**
@@ -303,18 +311,32 @@ When annotations are enabled as shown in the code snippet below, the following i
303311

304312
Optional models can be enabled in annotate (returns information when content was flagged, but not filtered) or filter mode (returns information when content was flagged and filtered).
305313

306-
When annotations are enabled as shown in the code snippet below, the following information is returned by the API for optional models: jailbreak risk, protected material text and protected material code:
307-
- category (jailbreak, protected_material_text, protected_material_code),
308-
- detected (true or false),
309-
- filtered (true or false).
314+
When annotations are enabled as shown in the code snippets below, the following information is returned by the API for optional models:
310315

311-
For the protected material code model, the following additional information is returned by the API:
312-
- an example citation of a public GitHub repository where a code snippet was found
313-
- the license of the repository.
316+
|Model| Output|
317+
|--|--|
318+
|jailbreak|detected (true or false), </br>filtered (true or false)|
319+
|indirect attacks|detected (true or false), </br>filtered (true or false)|
320+
|protected material text|detected (true or false), </br>filtered (true or false)|
321+
|protected material code|detected (true or false), </br>filtered (true or false), </br>Example citation of public GitHub repository where code snippet was found, </br>The license of the repository|
314322

315323
When displaying code in your application, we strongly recommend that the application also displays the example citation from the annotations. Compliance with the cited license may also be required for Customer Copyright Commitment coverage.
316324

317-
Annotations are currently available in the GA API version `2024-02-01` and in all preview versions starting from `2023-06-01-preview` for Completions and Chat Completions (GPT models). The following code snippet shows how to use annotations:
325+
See the following table for the annotation availability in each API version:
326+
327+
|Category |2024-02-01 GA| 2024-04-01-preview | 2023-10-01-preview | 2023-06-01-preview|
328+
|--|--|--|--|
329+
| Hate |||||
330+
| Violence |||||
331+
| Sexual |||||
332+
| Self-harm |||||
333+
| Prompt Shield for jailbreak attacks|||||
334+
|Prompt Shield for indirect attacks| || | |
335+
|Protected material text|||||
336+
|Protected material code|||||
337+
|Profanity blocklist|||||
338+
|Custom blocklist| ||||
339+
318340

319341
# [OpenAI Python 1.x](#tab/python-new)
320342

@@ -716,6 +738,81 @@ For details on the inference REST API endpoints for Azure OpenAI and how to crea
716738
}
717739
```
718740
741+
## Document embedding in prompts
742+
743+
A key aspect of Azure OpenAI's Responsible AI measures is the content safety system. This system runs alongside the core GPT model to monitor any irregularities in the model input and output. Its performance is improved when it can differentiate between various elements of your prompt like system input, user input, and AI assistant's output.
744+
745+
For enhanced detection capabilities, prompts should be formatted according to the following recommended methods.
746+
747+
### Chat Completions API
748+
749+
The Chat Completion API is structured by definition. It consists of a list of messages, each with an assigned role.
750+
751+
The safety system will parse this structured format and apply the following behavior:
752+
- On the latest “user” content, the following categories of RAI Risks will be detected:
753+
- Hate
754+
- Sexual
755+
- Violence
756+
- Self-Harm
757+
- Jailbreak (optional)
758+
759+
This is an example message array:
760+
761+
```json
762+
{"role": "system", "content": "Provide some context and/or instructions to the model."},
763+
{"role": "user", "content": "Example question goes here."},
764+
{"role": "assistant", "content": "Example answer goes here."},
765+
{"role": "user", "content": "First question/message for the model to actually respond to."}
766+
```
767+
768+
### Embedding documents in your prompt
769+
770+
In addition to detection on last user content, Azure OpenAI also supports the detection of specific risks inside context documents via Prompt Shields – Indirect Prompt Attack Detection. You should identify parts of the input that are a document (e.g. retrieved website, email, etc.) with the following document delimiter.
771+
772+
```
773+
<documents>
774+
*insert your document content here*
775+
</documents>
776+
```
777+
778+
When you do so, the following options are available for detection on tagged documents:
779+
- On each tagged “document” content, detect the following categories:
780+
- Indirect attacks (optional)
781+
782+
Here is an example chat completion messages array:
783+
784+
```json
785+
{"role": "system", "content": "Provide some context and/or instructions to the model, including document context. \"\"\" <documents>\n*insert your document content here*\n<\\documents> \"\"\""},
786+
787+
{"role": "user", "content": "First question/message for the model to actually respond to."}
788+
```
789+
790+
#### JSON escaping
791+
792+
When you tag unvetted documents for detection, the document content should be JSON-escaped to ensure successful parsing by the Azure OpenAI safety system.
793+
794+
For example, see the following email body:
795+
796+
```
797+
Hello Josè,
798+
799+
I hope this email finds you well today.
800+
```
801+
802+
With JSON escaping, it would read:
803+
804+
```
805+
Hello Jos\u00E9,\nI hope this email finds you well today.
806+
```
807+
808+
The escaped text in a chat completion context would read:
809+
810+
```json
811+
{"role": "system", "content": "Provide some context and/or instructions to the model, including document context. \"\"\" <documents>\n Hello Jos\\u00E9,\\nI hope this email finds you well today. \n<\\documents> \"\"\""},
812+
813+
{"role": "user", "content": "First question/message for the model to actually respond to."}
814+
```
815+
719816
## Content streaming
720817
721818
This section describes the Azure OpenAI content streaming experience and options. With approval, you have the option to receive content from the API as it's generated, instead of waiting for chunks of content that have been verified to pass your content filters.

0 commit comments

Comments
 (0)