|
3 | 3 | Computing per sample gradients is an integral part of Opacus framework. We strive to provide out-of-the-box support for |
4 | 4 | wide range of models, while keeping computations efficient. |
5 | 5 |
|
6 | | -We currently provide two independent approaches for computing per sample gradients: hooks-based ``GradSampleModule`` |
7 | | -(stable implementation, exists since the very first version of Opacus) and ``GradSampleModuleExpandedWeights`` |
8 | | -(based on a beta functionality available in PyTorch 1.12). |
| 6 | +We currently provide three independent approaches for computing per sample gradients: |
9 | 7 |
|
10 | | -Each of the two implementations comes with it's own set of limitations, and we leave the choice up to the client |
11 | | -which one to use. |
| 8 | +1. **Hooks-based `GradSampleModule`** (stable, wraps the model) |
| 9 | +2. **`GradSampleController`** (stable, no model wrapping - recommended for transformers) |
| 10 | +3. **`GradSampleModuleExpandedWeights`** (beta, based on PyTorch 1.12+ functionality) |
12 | 11 |
|
13 | | -``GradSampleModuleExpandedWeights`` is currently in early beta and can produce unexpected errors, but potentially |
14 | | -improves upon ``GradSampleModule`` on performance and functionality. |
| 12 | +Each implementation comes with its own set of limitations and benefits. |
15 | 13 |
|
16 | | -**TL;DR:** If you want stable implementation, use ``GradSampleModule`` (`grad_sample_mode="hooks"`). |
17 | | -If you want to experiment with the new functionality, you have two options. Try |
18 | | -``GradSampleModuleExpandedWeights``(`grad_sample_mode="ew"`) for better performance and `grad_sample_mode=functorch` |
19 | | -if your model is not supported by ``GradSampleModule``. |
| 14 | +**TL;DR:** |
| 15 | +- Use `GradSampleModule` (`grad_sample_mode="hooks"`) for stable implementation with standard models |
| 16 | +- Use `GradSampleController` via `PrivacyEngineGradSampleController` for transformer models and when you need direct model access without wrapping |
| 17 | +- Use `GradSampleModuleExpandedWeights` (`grad_sample_mode="ew"`) if you want to experiment with better performance |
| 18 | +- Use `grad_sample_mode="functorch"` if your model has unsupported layers |
20 | 19 |
|
21 | | -Please switch back to ``GradSampleModule``(`grad_sample_mode="hooks"`) if you encounter strange errors or unexpexted behaviour. |
22 | | -We'd also appreciate it if you report these to us |
| 20 | +Please report any strange errors or unexpected behaviour to us! |
23 | 21 |
|
24 | | -## Hooks-based approach |
| 22 | +## GradSampleController approach (No Model Wrapping) |
| 23 | +- Controller class: ``opacus.grad_sample.GradSampleController`` |
| 24 | +- Privacy Engine: ``opacus.privacy_engine_gsc.PrivacyEngineGradSampleController`` |
| 25 | +- Usage: Use `PrivacyEngineGradSampleController` instead of `PrivacyEngine` |
| 26 | + |
| 27 | +**Recommended for transformer models and when model wrapping causes issues.** |
| 28 | + |
| 29 | +Computes per-sample gradients by attaching hooks directly to model parameters without wrapping the model in a |
| 30 | +`GradSampleModule`. This approach: |
| 31 | + |
| 32 | +- ✅ Preserves model type (e.g., `isinstance(model, BertModel)` remains `True`) |
| 33 | +- ✅ No `_module.` prefix in state_dict |
| 34 | +- ✅ Direct access to model attributes (no attribute forwarding needed) |
| 35 | +- ✅ Better compatibility with HuggingFace transformers and models with custom `__getattr__` |
| 36 | +- ✅ Same grad sampler methods as `GradSampleModule` |
| 37 | + |
| 38 | +See [CONTROLLER_BASED_PRIVACY_ENGINE.md](../../docs/CONTROLLER_BASED_PRIVACY_ENGINE.md) for detailed documentation. |
| 39 | + |
| 40 | +## Hooks-based approach (Model Wrapping) |
25 | 41 | - Model wrapping class: ``opacus.grad_sample.grad_sample_module.GradSampleModule`` |
26 | 42 | - Keyword argument for ``PrivacyEngine.make_private()``: `grad_sample_mode="hooks"` |
27 | 43 |
|
@@ -62,23 +78,27 @@ is roughly the same. |
62 | 78 | Please note that these are known limitations and we plan to improve Expanded Weights and bridge the gap in feature completeness |
63 | 79 |
|
64 | 80 |
|
65 | | -| xxx | Hooks | Expanded Weights | Functorch | |
66 | | -|:----------------------------:|:-------------------------------:|:----------------:|:------------:| |
67 | | -| Required PyTorch version | 1.8+ | 1.13+ | 1.12 (to be updated) | |
68 | | -| Development status | Underlying mechanism deprecated | Beta | Beta | |
69 | | -| Runtime Performance† | baseline | ✅ ~25% faster | 🟨 0-50% slower | |
70 | | -| Any DP-allowed†† layers | Not supported | Not supported | ✅ Supported | |
71 | | -| Most popular nn.* layers | ✅ Supported | ✅ Supported | ✅ Supported | |
72 | | -| torchscripted models | Not supported | ✅ Supported | Not supported | |
73 | | -| Client-provided grad sampler | ✅ Supported | Not supported | ✅ Not needed | |
74 | | -| `batch_first=False` | ✅ Supported | Not supported | ✅ Supported | |
75 | | -| Recurrent networks | ✅ Supported | Not supported | ✅ Supported | |
76 | | -| Padding `same` in Conv | ✅ Supported | Not supported | ✅ Supported | |
77 | | -| Empty poisson batches | ✅ Supported | Not supported | Not supported | |
78 | | - |
79 | | -† Note, that performance differences are unstable and can vary a lot depending on the exact model and batch size. |
80 | | -Numbers above are averaged over benchmarks with small models consisting of convolutional and linear layers. |
81 | | -Note, that performance differences are only observed on GPU training, CPU performance seem to be almost identical |
| 81 | +| xxx | GradSampleModule (Hooks) | GradSampleController | Expanded Weights | Functorch | |
| 82 | +|:----------------------------:|:------------------------:|:-------------------:|:----------------:|:------------:| |
| 83 | +| Required PyTorch version | 1.8+ | 1.8+ | 1.13+ | 1.12 (to be updated) | |
| 84 | +| Development status | Deprecated mechanism | ✅ Stable | Beta | Beta | |
| 85 | +| Model wrapping | ✅ Wraps model | ✅ No wrapping | ✅ Wraps model | ✅ Wraps model | |
| 86 | +| Runtime Performance† | baseline | baseline | ✅ ~25% faster | 🟨 0-50% slower | |
| 87 | +| Transformer compatibility | 🟨 May have issues | ✅ Excellent | 🟨 May have issues | 🟨 May have issues | |
| 88 | +| State dict compatibility | 🟨 `_module.` prefix | ✅ Clean keys | 🟨 `_module.` prefix | 🟨 `_module.` prefix | |
| 89 | +| Type preservation | ❌ Model wrapped | ✅ Model unchanged | ❌ Model wrapped | ❌ Model wrapped | |
| 90 | +| Any DP-allowed†† layers | Not supported | Not supported | Not supported | ✅ Supported | |
| 91 | +| Most popular nn.* layers | ✅ Supported | ✅ Supported | ✅ Supported | ✅ Supported | |
| 92 | +| torchscripted models | Not supported | Not supported | ✅ Supported | Not supported | |
| 93 | +| Client-provided grad sampler | ✅ Supported | ✅ Supported | Not supported | ✅ Not needed | |
| 94 | +| `batch_first=False` | ✅ Supported | ✅ Supported | Not supported | ✅ Supported | |
| 95 | +| Recurrent networks | ✅ Supported | ✅ Supported | Not supported | ✅ Supported | |
| 96 | +| Padding `same` in Conv | ✅ Supported | ✅ Supported | Not supported | ✅ Supported | |
| 97 | +| Empty poisson batches | ✅ Supported | ✅ Supported | Not supported | Not supported | |
| 98 | + |
| 99 | +† Note, that performance differences are unstable and can vary a lot depending on the exact model and batch size. |
| 100 | +Numbers above are averaged over benchmarks with small models consisting of convolutional and linear layers. |
| 101 | +Note, that performance differences are only observed on GPU training, CPU performance seem to be almost identical |
82 | 102 | for all approaches. |
83 | 103 |
|
84 | 104 | †† Layers that produce joint computations on batch samples (e.g. BatchNorm) are not allowed under any approach |
|
0 commit comments