This project implements a framework for generating and evaluating transferable adversarial attacks on the CIFAR-10 dataset. It focuses on training a U-Net based generator to produce perturbations that can fool a "black-box" victim model by leveraging distillation and advanced transferability techniques.
The pipeline consists of three main stages:
Knowledge is extracted from a pre-trained Victim Model (ResNet-50) and distilled into two Surrogate Models (ResNet-18 and VGG-16). This is achieved using soft distillation (KL Divergence loss) on a subset of the CIFAR-10 training data. These surrogates act as white-box proxies for the generator to attack.
A U-Net Generator is trained to produce adversarial perturbations. To ensure these perturbations generalize to the unseen victim model, the training incorporates:
- Input Diversity: Randomly resizing and padding inputs (DIM) to prevent overfitting to specific pixel geometries.
- Ghost Networks: Maintaining dropout layers in "train" mode during the backward pass to simulate an ensemble of network architectures.
- C&W Margin Loss: Optimizing the margin between the correct class and the most likely wrong class.
- Norm Regulation: A MSE loss to keep the perturbations subtle.
The generator's performance is compared against a PGD (Projected Gradient Descent) baseline. The evaluation measures:
- Attack Success Rate (ASR): The percentage of images misclassified by the victim after perturbation.
- Inference Speed: The time required to generate an adversarial batch (Single forward pass vs. iterative optimization).
- Visual Fidelity: Comparing clean, Generator-produced, and PGD-produced images.
models.py: Definitions for the Victim, Surrogates, and U-Net Generator.distill.py: Script for training the victim and distilling knowledge into surrogates.train_generator.py: Implementation of the generator training loop with transferability enhancements.attack_eval.py: Evaluation script providing metrics and visual comparisons.victim.pth,generator_best.pth: Model weights (not included in repository).
The Generator provides a significant speedup over iterative attacks like PGD while maintaining a high success rate against the black-box victim.
| Attack Method | Success Rate | Avg. Time per Batch |
|---|---|---|
| PGD (10 Step) | 100% | 0.45s |
| Generator | 68% | 0.0014s |
Representative adversarial examples generated by this framework:
- Train/Distill Models:
python distill.py
- Train Generator:
python train_generator.py
- Evaluate:
python attack_eval.py
