Skip to content

Query-free adversarial attacks on image-to-image diffusion models. Small perturbations (ε=8/255) break InstructPix2Pix & Kandinsky. Defenses fail.

Notifications You must be signed in to change notification settings

MarioRicoIbanez/AdversarialML-I2I

Repository files navigation

Query-Free Adversarial Attacks on Image-to-Image Models

Python 3.9+ License: MIT Paper

This repository implements query-free adversarial attacks on image-to-image diffusion models by perturbing publicly available image encoders. We demonstrate that small, imperceptible perturbations ($\epsilon \geq 8/255$) can severely degrade generation quality across multiple architectures, and surprisingly, neither improved training data nor robust encoders provide effective defense.


📄 Full Paper

📑 Query-free Attacks on Image-to-image Models are Hard to Avoid (PDF)

Abstract: We evaluate query-free adversarial attacks targeting image-to-image generative models across different encoder architectures, specifically Variational Autoencoders (VAE) and CLIP image encoders. We investigate the perturbation budget ($\epsilon$) required to effectively induce low-quality generations. We observe that neither improving data quality nor introducing robust image encoders improves generation quality under adversarial noise, leaving a gap for future work to investigate effective defenses.


🎯 Key Contributions

  1. Comprehensive $\epsilon$ Analysis: We provide qualitative and quantitative analysis showing that perturbations as small as $\epsilon = 8/255$ can produce highly dissimilar outputs while maintaining visual plausibility (PSNR > 30 dB).

  2. Cross-Architecture Evaluation: We test VAE-based encoders (InstructPix2Pix, InstructCLIP-Pix2Pix) and CLIP-based encoders (Kandinsky 2.2), revealing that both are vulnerable to query-free attacks.

  3. Defense Evaluation: We demonstrate that current defense mechanisms—including training on CLIP-filtered high-quality data and replacing encoders with adversarially robust versions (RobustCLIP)—fail to provide robustness, leaving an open problem for future research.

  4. Attack Methodology: We employ Auto-PGD (APGD), an adaptive first-order attack that eliminates manual hyperparameter tuning while achieving 6% better effectiveness than vanilla PGD at $\epsilon = 16/255$.


🧠 Evaluated Models

Model Architecture Encoder Type Paper
InstructPix2Pix Stable Diffusion VAE Brooks et al., 2023
InstructCLIP-Pix2Pix Stable Diffusion + LoRA VAE (CLIP-filtered data) Chen et al., 2025
Kandinsky 2.2 Latent Diffusion CLIP ViT-L/14 Razzhigaev et al., 2023

🛡️ Evaluated Defense Mechanisms

Defense Strategy Description Effectiveness Reference
CLIP-Filtered Training Data Train on contrastively curated dataset (InstructCLIP-Pix2Pix) No improvement under adversarial noise Chen et al., 2025
RobustCLIP Encoder Replace Kandinsky's CLIP encoder with adversarially fine-tuned RobustCLIP No difference observed Schlarmann et al., 2024

Key Finding: Our experiments show that neither improved training data quality nor robust encoders provide effective defense against query-free attacks, highlighting an important open problem for future research.


🔬 Attack Methodology

Query-Free Optimization

Given an image encoder $f: \mathbb{R}^d \to \mathbb{R}^k$ (e.g., CLIP or VAE), we optimize:

$$ \max_{|\boldsymbol\delta|_{\infty} \leq \epsilon} \mathcal{D}\big(f(\mathbf{x}), f(\mathbf{x} + \boldsymbol\delta)\big), \quad \text{s.t.} \quad \mathbf{x} + \boldsymbol\delta \in [0,1]^d $$

where $\mathcal{D}$ is either:

  • Euclidean distance: $\mathcal{D}(\mathbf{z}_1, \mathbf{z}_2) = |\mathbf{z}_1 - \mathbf{z}_2|_2$ (effective for VAE encoders)
  • Cosine similarity: $\mathcal{D}(\mathbf{z}_1, \mathbf{z}_2) = 1 - \frac{\mathbf{z}_1 \cdot \mathbf{z}_2}{|\mathbf{z}_1| |\mathbf{z}_2|}$ (effective for CLIP encoders)

Auto-PGD (APGD) Update Rule

The attack iteratively updates the perturbation using:

$$ \boldsymbol\delta_{t+1} = \Pi_{|\cdot|_{\infty} \leq \epsilon}\Big(\boldsymbol\delta_t + \alpha_t \cdot \text{sign}\big(\nabla_{\boldsymbol\delta_t} \mathcal{D}(f(\mathbf{x}), f(\mathbf{x} + \boldsymbol\delta_t))\big)\Big) $$

where $\Pi$ is the projection operator and $\alpha_t$ is adaptively scheduled by APGD based on the loss landscape, eliminating manual step-size tuning.

Key Advantages over PGD:

  • Automatic step-size adaptation
  • Momentum accumulation for stability
  • Checkpoint rollback on overshooting
  • 6% relative improvement in attack effectiveness at $\epsilon = 16/255$

📦 Installation

We use uv for fast, reproducible dependency management.

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/MarioRicoIbanez/AdversarialML-I2I.git
cd AdversarialML-I2I

# Create virtual environment and install dependencies
uv sync

# Activate the environment
source .venv/bin/activate  # Linux/Mac
# OR
.venv\Scripts\activate  # Windows

Manual Installation (pip):

pip install -e .

🚀 Quick Start

Test All Models

Run a quick sanity check on all supported models:

CUDA_VISIBLE_DEVICES=0 uv run python test_all_models.py

This will:

  • Load each model (InstructPix2Pix, InstructCLIP, Kandinsky)
  • Run a 3-step APGD attack with $\epsilon = 16/255$
  • Generate adversarial outputs
  • Report success/failure for each model

Visual Experiment: Multi-$\epsilon$ Analysis

Generate side-by-side comparisons across multiple perturbation budgets:

CUDA_VISIBLE_DEVICES=0 uv run python visual_experiment.py

Output Structure:

experiments_output/visual_test/
├── pix2pix_Distance_eps8_sample0.png       # Grid: Original | Adversarial | Generated
├── pix2pix_Distance_eps16_sample0.png
├── pix2pix_Distance_eps32_sample0.png
├── pix2pix_Distance_eps64_sample0.png
├── kandinsky_Similarity_eps8_sample0.png
├── ...
├── individual/                              # Individual images per attack
│   ├── pix2pix_Distance_eps16_sample0/
│   │   ├── 1_original.png
│   │   ├── 2_adversarial.png
│   │   └── 3_generated.png
└── summary.txt                              # Quantitative results (CLIP similarity, L2 distance)

Configuration:

  • Models: pix2pix, pix2pix-lora, kandinsky
  • Loss Functions: Distance (Euclidean), Similarity (Cosine)
  • $\epsilon$ Values: [8, 16, 32, 64]/255
  • Attack Parameters: 10 iterations, $\alpha = 0.1$

📊 Key Results

Effect of Perturbation Budget $\epsilon$

$\epsilon$ PSNR (dB) CLIP Sim (Orig→Gen) CLIP Sim (Prompt→Gen) Attack Visibility
1/255 50.8 0.883 ± 0.101 0.248 ± 0.036 Imperceptible
8/255 34.0 0.812 ± 0.113 0.256 ± 0.034 Effective threshold
16/255 27.9 0.744 ± 0.119 0.263 ± 0.030 Slight artifacts
32/255 22.0 0.677 ± 0.116 0.266 ± 0.028 Visible distortion
64/255 16.8 0.631 ± 0.109 0.267 ± 0.027 Severe corruption

Insight: $\epsilon \in [8/255, 16/255]$ represents the sweet spot—strong semantic drift while maintaining visual plausibility (PSNR > 27 dB).

Defense Mechanisms: Do They Work?

Defense Strategy Model CLIP Sim @ $\epsilon = 16/255$ Effective?
Baseline InstructPix2Pix 0.744 ± 0.119
High-Quality Data InstructCLIP-Pix2Pix 0.743 ± 0.110 ❌ No improvement
Robust Encoder Kandinsky + RobustCLIP 0.708 ± 0.105 ❌ No improvement

Conclusion: Current defense mechanisms provide no measurable robustness against query-free encoder attacks, highlighting an important open research problem.


📂 Repository Structure

AdversarialML-I2I/
├── src/adversarial_i2i/          # Core attack library
│   ├── attacks/
│   │   ├── apgd.py               # Auto-PGD implementation
│   │   └── pgd.py                # Vanilla PGD baseline
│   ├── models/
│   │   └── wrappers.py           # Model encoder wrappers (VAE, CLIP, etc.)
│   ├── evaluation/
│   │   └── metrics.py            # CLIP similarity, PSNR, etc.
│   └── utils/
│       ├── data.py               # Dataset loading utilities
│       └── image.py              # Image preprocessing/postprocessing
├── test_all_models.py            # Sanity check script
├── visual_experiment.py          # Multi-epsilon visual analysis
├── assets/
│   └── 2025_Rico_AttacksI2I.pdf       # Full paper
├── pyproject.toml                # Project metadata + dependencies
├── uv.lock                       # Dependency lock file
└── README.md                     # This file

🔧 Advanced Usage

Custom Attack Configuration

from src.adversarial_i2i.models import load_model
from src.adversarial_i2i.attacks import apgd_attack
from torchvision.transforms.functional import to_pil_image

# Load model
model = load_model("pix2pix")

# Preprocess image
image_tensor = model.preprocess(pil_image)  # Shape: [1, 3, H, W]

# Run APGD attack
adversarial = apgd_attack(
    encoder=model,
    image=image_tensor,
    batch_size=1,
    pixel_change=16,        # epsilon = 16/255
    epochs=100,             # Attack iterations
    alpha=0.1,              # Initial step size (auto-adapted)
    loss_type="Distance",   # "Distance" (L2) or "Similarity" (cosine)
    verbose=True
)

# Generate with adversarial input
adversarial_pil = to_pil_image(adversarial[0])
output = model.pipe(
    prompt=["Turn it into a photo"],
    image=[adversarial_pil],
    num_inference_steps=50,
    image_guidance_scale=1.5,
    guidance_scale=7.5
).images[0]

📈 Evaluation Metrics

We evaluate attacks using multiple complementary metrics:

Metric Description Interpretation
CLIP Similarity (Orig→Gen) Cosine similarity between original and generated image embeddings Lower = stronger attack
CLIP Similarity (Prompt→Gen) Alignment between text prompt and generated output Should remain high (instruction following)
CLIP Similarity (Orig→Adv) Perceptual similarity of adversarial perturbation Higher = stealthier attack
PSNR (dB) Peak Signal-to-Noise Ratio between original and adversarial image Higher = less visible distortion
L2 Distance Euclidean distance in latent space Measures encoder displacement

🤝 Acknowledgements

This work was conducted at the Laboratory for Information and Inference Systems (LIONS) at EPFL, Switzerland.

Authors:

  • Mario Rico Ibáñez – Master's student in Computer Science at EPFL (mario.ricoibanez@epfl.ch)
  • Elias Abad Rocamora – PhD student at LIONS, EPFL
  • Prof. Volkan Cevher – Director of LIONS Lab, EPFL

Laboratory: LIONS – Laboratory for Information and Inference Systems


📄 License

This project is licensed under the MIT License.

For academic use only. Commercial applications require explicit permission.


💬 Contact

For questions, issues, or collaboration inquiries:


⚠️ Responsible Disclosure: This research is intended to improve the robustness of generative AI systems. Please use this code ethically and responsibly.

About

Query-free adversarial attacks on image-to-image diffusion models. Small perturbations (ε=8/255) break InstructPix2Pix & Kandinsky. Defenses fail.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages