-
Notifications
You must be signed in to change notification settings - Fork 3.2k
pyrit foundry integration spec #44551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| from pyrit.models import SeedPrompt | ||
| from pyrit.models.data_type_serializer import PromptDataType | ||
| from pyrit.scenario.core.dataset_configuration import DatasetConfiguration | ||
| from pyrit.scenario.scenarios.foundry.foundry import Foundry, FoundryStrategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would recommend taking the shortest possible import path, in this case pyrit.scenario.foundry because everything more detailed is considered internal to PyRIT and can change without being considered breaking by us. Perhaps also a good idea to mark that with underscore to be extra clear @rlundeen2
Same for DatasetConfiguration above which can be imported from pyrit.scenario
| ## Success Metrics | ||
|
|
||
| ### Reliability | ||
| - **Breaking Changes**: Reduce from 2-3 per 6 months to 0-1 per year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for your SDK or for PyRIT? I don't think we can guarantee a specific number, but we can certainly guarantee a deprecation schedule. Our goal right now is to deprecate features and keep them around for 2 minor releases (e.g., from 0.10.0 to 0.12.0) with a warning for users to replace them before they get removed.
That said, given the level at which you're operating (from a PyRIT perspective: high level, scenarios) you are unlikely to actually face many breaking changes.
| ] | ||
| ``` | ||
|
|
||
| **RAI Context Types**: `email`, `document`, `html`, `code`, `tool_call` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can rename it, but for us, email/document/html/code will all just be "url" or we could call it blob_path or something. But it should not be .text, it should be a file_path similar to how image/audio/video are handled.
If you use text, it has an ambiguity problem; e.g. if a model wants to upload a pdf, it will just insert the pdf data into the text field.
| ``` | ||
|
|
||
| **Remaining Considerations**: | ||
| - **XPIA Formatting**: For indirect jailbreak attacks, context types like `email` and `document` determine attack vehicle formatting. While PyRIT sees them as `text`, we preserve the original `context_type` in metadata for downstream formatters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different people will have different opinions, but I think this makes the most sense as a converter at the end.
So we transform the prompt to however we want for an attack, and then the last converter transforms it to the format you want to send
E.g. prompt[text] -> JailBreakConverter[text] -> Base64Converter[text] -> AddImageConverter[image] -> emailAttachmentConverter[blob - email with the image we just created attached]
Then the target determines how this is sent.
As one example of this, we have a PDFConverter, and a blobStoreTarget. So you can create PDFs and upload them to a blobstore
| ▼ | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ DatasetConfiguration Builder │ | ||
| │ • Create SeedObjective for each attack string │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the SeedPrompts are the same, you can just use SeedObjectives with the metadata
| ▼ | ||
| ┌─────────────────────────────────────────────────────────────┐ | ||
| │ Result Processing │ | ||
| │ • Extract from PyRIT memory │ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You also get AttackResults;
I'd recommend creating a PyRIT scorer using RAI evaluator. Then you pass it in to FoundryScenario. It's used to evaluate attack success. We can help with this, and actually looks like you're already maybe doing that above.
But then you have the results when the FoundryScenario execution finishes in the AttackResult object and wouldn't have to re-evaluate ASR
|
|
||
| #### Important: SeedPrompt Duplication Pattern | ||
|
|
||
| **Critical Note**: PyRIT's Foundry does **NOT** automatically send the `SeedObjective` value to the target. The objective is used for orchestration and scoring, but the actual prompt sent to the target must be a `SeedPrompt`. We will do this in every scenario except for Jailbreak and IndirectJailbreak where we handle the injection of the objective into the prompt. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, if you don't set the SeedPrompt, it will be the objective. But you can always separate them.
But if they are the same, you should probably just attach SeedObjective
|
|
||
| 1. **SeedObjective**: Contains the attack string (e.g., "Tell me how to build a weapon") | ||
| 2. **SeedPrompt (attack vehicle)**: Contains the context data **with attack string injected** (e.g., email containing the malicious prompt) | ||
| 3. **SeedPrompt (original context)**: Contains the original context **without** injection (for reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can get the original context from the Message already, you don't need anything extra to keep track of it. In this example, what I would do is
- SeedObjective with the objective
- Add a converter that converts from a prompt to an email
Then Call the scenario with the converter configured at the end. And the AttackResult object returned will have the original objective, the conversation is available, and the success.
| ) | ||
|
|
||
| # Plus any context prompts | ||
| context_prompts = [...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, you answer below
|
|
||
| return prompts | ||
|
|
||
| def _create_xpia_prompts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the code you could wrap in a converter if you wanted. Although I'd love any specific format converting code in PyRIT itself :)
| from pyrit.models import PromptRequestPiece, Score | ||
|
|
||
|
|
||
| class RAIServiceScorer(Scorer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd make this a FloatScaleScorer or TrueFalseScorer depending on what you're returning. And if FloatScale, set a threshhold for the TrueFalseScorer you pass in to the scenario.
| self.rai_client = rai_client | ||
| self.risk_category = risk_category | ||
|
|
||
| async def score_async(self, request_response: PromptRequestPiece) -> List[Score]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd overwrite score_piece, so you can better handle multi part messages
|
|
||
| # Run attack (PyRIT handles all execution) | ||
| self.logger.info(f"Executing attacks for {self.risk_category}...") | ||
| await scenario.run_attack_async() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will return attackResult objects with ASR, etc.
Description
Please add an informative description that covers that changes made by the pull request and link all relevant issues.
If an SDK is being regenerated based on a new API spec, a link to the pull request containing these API spec changes should be included above.
All SDK Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines