-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Is your feature request related to a problem? Please describe.
Suppose I want to perform an evasion attack, where only a subset of attributes can be mutated by the adversary; the remaining attributes cannot be modified. A related but separate question is: how to denote the group of attributes for binarized data, where only one category can be selected at once (and selecting multiple would render the data invalid)?
In this example, values of (A, B) are immutable. (C,D,E) are binarized values of a categorical attribute after preprocessing; the rest (F, ....) can be mutated freely by the adversary:
| A π | B π | C | D | E | F | .... | label |
|---|---|---|---|---|---|---|---|
| 3.5 | 4.2 | 1 | 0 | 0 | 1 | .... | 0 |
| 1.8 | -3 | 0 | 0 | 1 | 1 | .... | 0 |
How can I setup the attack so that these constraints are guaranteed to be preserved in the generated adversarial instances?
This question is for the ART in general, and I am looking for an existing (or future) way to achieve this behavior.
Describe the solution you'd like
I would like to specify explicitly, as an attack parameter, the im/mutable attributes and similar firm constraints about relationships between attributes (if there is an existing way to achieve this behavior, please advice).
Describe alternatives you've considered
It is unclear to me currently, if the specific attacks in theory support this kind of constrained scenario (I will need to review the papers).
Assuming this can be done, then the technical alternatives are to: (A) run the attack first, then post-prune the examples that are invalid, or (B) extend the toolkit to support this behavior. Simply removing immutable attributes is not an option, because they are needed for training.
This question may be silly in black-box setting, where attacker is not supposed to know about the internals of the classifier, however, let's assume it is "common knowledge" that the data must adhere to some format, that extends beyond the classifier, and attacker is aware of this. Then it is not unreasonable to assume attacker wants to preserve these constraints.