HomebrewML
diff --git a/‎README.md‎
Lines changed: 86 additions & 39 deletions b/‎README.md‎
Lines changed: 86 additions & 39 deletions
diff --git a/‎setup.py‎
Lines changed: 1 addition & 1 deletion b/‎setup.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎truegrad/functional.py‎
Lines changed: 110 additions & 19 deletions b/‎truegrad/functional.py‎
Lines changed: 110 additions & 19 deletions
@@ -12,11 +12,81 @@ python3 -m pip install truegrad
 
 ## Examples
 
+TrueGrad supports various backends, each with their own tradeoffs:
+
+| Name                                               | Advantages                                                                                                                                                                                      | Disadvantages                                                                           |
+|----------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|
+| [truegrad.nn](#nn)                                 | * What you see is what you get - Modules not in truegrad.nn and truegrad.nn.functional are not supported<br/>* Custom forward/backward for some fused functions<br/>* Optimized backward passes | * Limited applicability - custom modules can't be used<br/>* Requires code modification |
+| [truegrad.utils.patch_torch](#patch-torch)         | * Uses truegrad.nn under the hood<br/>* Works for many (off-the-shelf!) torch models<br/>* No code modification necessary                                                                       | * Uncertainty if model is compatible                                                    |
+| [backpack](#backpack)                              | * Highest stability<br/>* Loud warnings and errors<br/>* Battle-tested<br/>* Simple to extend further                                                                                           | * High memory usage<br/>* High compute usage<br/>* Sparse support for torch operations  |
+| [truegrad.utils.patch_model](#patch-custom-models) | * Best compatibility                                                                                                                                                                            | * Fails silently on fused functions<br/>* More costly than truegrad.nn                  |
+
+Below, you'll find examples for each of these backends, as well as a [general strategy](#partial-truegrad) allowing
+partial application of TrueGrad.
+
+### nn
+
+The preferred method of using TrueGrad is by replacing `torch.nn` with performant `truegrad.nn` modules. While other
+methods add compute and memory overheads, `truegrad.nn` and `truegrad.nn.functional` have hand-crafted gradients. This
+is the most powerful method, although it requires code modifications.
+
+```PYTHON
+import torch
+from truegrad import nn
+from truegrad.optim import TGAdamW
+
+# define model by mixing truegrad.nn and torch.nn
+model = torch.nn.Sequential(nn.Linear(1, 10),
+                            nn.LayerNorm([1, 10]),
+                            torch.nn.ReLU(),
+                            nn.Linear(10, 1))
+optim = TGAdamW(model.parameters())  # truegrad.optim.TGAdamW instead of torch.optim.AdamW
+
+# standard training loop 
+while True:
+    input = torch.randn((16, 1))
+    model(input).mean().backward()
+    optim.step()
+```
+
+### Patch Torch
+
+In some cases, you can't modify the model's source. For example, when importing models from `torchvision`. If that's the
+case, or if you simply want to try out TrueGrad, you can use `truegrad.utils.patch_torch()`, to
+replace `torch.nn.Module`'s with `truegrad.nn.Module`'s where possible. For example, the code below can be used to train
+a ResNet-18:
+
+```PYTHON
+import torch
+from torchvision.models import resnet18
+
+from truegrad.optim import TGAdamW
+from truegrad.utils import patch_torch
+
+patch_torch()  # call before model creation, otherwise complete freedom
+model = resnet18().cuda()
+optim = TGAdamW(model.parameters(), lr=1e-7, weight_decay=0)
+
+# constant input/output to overfit
+inp = torch.randn((2, 3, 224, 224)).cuda()
+tgt = torch.randint(0, 1000, (2,)).cuda()
+
+# standard training loop
+i = 0
+while True:
+    loss = torch.nn.functional.cross_entropy(model(inp), tgt)
+    loss.backward()
+    optim.step()
+    i += 1
+    if i % 5 == 0:
+        print(i, loss.item())
+```
+
 ### BackPack
 
-The preferred method to integrate TrueGrad is using [BackPack](https://github.com/f-dangel/backpack). BackPack is a
-third-party library that automatically computes the sum of gradient squares and works for most models by implementing
-custom backward rules for many `torch.nn.Module`'s.
+The most stable although also memory hungry method to compute TrueGrad statistics is to use
+[BackPack](https://github.com/f-dangel/backpack). BackPack is a third-party library that automatically computes the sum
+of gradient squares and works for most models by implementing custom backward rules for many `torch.nn.Module`'s.
 
 ```PYTHON
 import backpack
@@ -25,10 +95,10 @@ from torch.nn import CrossEntropyLoss
 from truegrad.optim import TGAdamW
 from torchvision.models import alexnet
 
-model = alexnet()
+model = alexnet()  # BatchNorm and in-place ops (like ResNet's residual path) aren't supported
 optim = TGAdamW(model.parameters(), lr=1e-7, weight_decay=0)
 
-# backpack can't handle inplace ops like nn.ReLU(inplace=True) and `x += y`
+# replace inplace ops like nn.ReLU(inplace=True) where possible
 for mod in model.modules():
     if hasattr(mod, "inplace"):
         mod.inplace = False
@@ -62,12 +132,13 @@ your model has any layer called `.output` or you're using PyTorch >= 1.13, you w
 ### Patch Custom Models
 
 Another option to integrate TrueGrad into existing models is to patch them using `truegrad.utils.patch_model()`.
-`patch_model()` will go through all`torch.nn.Module`'s in PyTorch model and convert their `torch.nn.Parameter`'s to
+`patch_model()` will go through all `torch.nn.Module`'s in PyTorch model and convert their `torch.nn.Parameter`'s to
 `truegrad.nn.TrueGradParameter`'s. A `TrueGradParameter` acts largely the same as a `torch.nn.Parameter`, but adds
-required operations into the model's backward pass.\
+required operations into the model's backward pass. Note that this doesn't give the most effective computation graph,
+but works well for many custom models.\
 Importantly, be aware that this does not work for fused functions, such as `torch.nn.LayerNorm`
-and `torch.nn.MultiheadAttention`. However, unfused functions which directly access a parameter, such as multiplication
-and work well. Therefore, torch.nn.Linear and HuggingFace's attention work as expected.
+and `torch.nn.MultiheadAttention`. However, unfused functions which directly access a parameter, such as multiplication,
+work well. Therefore, torch.nn.Linear and HuggingFace's attention work as expected.
 
 ```PYTHON
 import transformers
@@ -87,35 +158,6 @@ for sample in ["Hello", "World", "!"]:
     optim.step()
 ```
 
-### nn
-
-Patching existing PyTorch computation graphs on the fly might add unnecessary memory and computation or even fail
-unexpectedly. That's why a pre-patched alternative of `torch.nn` with hand-crafted gradients exists alongside the
-`truegrad.utils` module. Compared to `truegrad.utils.patch_model()`, `truegrad.nn` offers higher speeds and lower
-memory usage, although it might require code alterations and doesn't support all models. You cannot (currently) use
-`truegrad.nn` with `truegrad.utils`, as both use different ways to arrive at the same value. However, you can
-combine `torch.nn.Modules` and `truegrad.nn.Modules` and use the truegrad information only where it is available (
-see [Partial TrueGrad](#Partial-TrueGrad)).
-
-```PYTHON
-import torch
-from truegrad import nn
-from truegrad.optim import TGAdamW
-
-# define model by mixing truegrad.nn and torch.nn
-model = torch.nn.Sequential(nn.Linear(1, 10),
-                            nn.LayerNorm([1, 10]),
-                            torch.nn.ReLU(),
-                            nn.Linear(10, 1))
-optim = TGAdamW(model.parameters())  # truegrad.optim.TGAdamW instead of torch.optim.AdamW
-
-# standard training loop 
-while True:
-    input = torch.randn((16, 1))
-    model(input).mean().backward()
-    optim.step()
-```
-
 ### Partial TrueGrad
 
 Unfortunately, it's not always sensible to apply TrueGrad, as some backward passes are too slow, and sometimes it's
@@ -138,8 +180,13 @@ model = torch.nn.Sequential(nn.Linear(1, 10),  # Weights coming from truegrad.nn
 optim = TGAdamW(model.parameters(), default_to_adam=True)
 
 # standard training loop
+i = 0
 while True:
     input = torch.randn((16, 1))
-    model(input).mean().backward()
+    loss = model(input).mean()
+    loss.backward()
     optim.step()
+    i += 1
+    if i % 5 == 0:
+        print(i, loss.item())
 ```
@@ -10,7 +10,7 @@
     name='truegrad',
     license='BSD',
     description='PyTorch interface for TrueGrad-AdamW',
-    version='0.1.0',
+    version='1.0.0',
     long_description=README,
     url='https://github.com/clashluke/truegrad',
     packages=setuptools.find_packages(),
 
@@ -1,20 +1,61 @@
-from typing import List, Tuple
+from typing import Any, Callable, List, Tuple
 
 import torch
+from torch.utils._pytree import tree_map
+
+
+def _unpack(x: Any) -> Any:
+    if isinstance(x, TrueGradTensor):
+        return x.data
+    return x
+
+
+_base_torch_function = torch.Tensor.__torch_function__
+
+
+class TrueGradTensor(torch.Tensor):
+    sum_grad_squared: torch.Tensor
+    data: torch.Tensor
+    requires_grad: bool
+
+    __slots__ = ['sum_grad_squared', "data", "requires_grad"]
+
+    @staticmethod
+    def __new__(cls, data: torch.Tensor):
+        meta = data.new_empty((0,))
+        meta.set_(meta.storage(), 0, data.size(), data.stride())
+        r = torch.Tensor._make_subclass(cls, meta, data.requires_grad)
+        r.data = data
+        r.sum_grad_squared = None
+        r.activated = False
+        r.requires_grad = data.requires_grad
+        return r
+
+    def __repr__(self):
+        return f"TrueGradTensor({self.data})"
+
+    @classmethod
+    def __torch_function__(cls, func, types, args=(), kwargs=None):
+        if kwargs is None:
+            kwargs = {}
+        out = _base_torch_function(func, [], tree_map(_unpack, args), tree_map(_unpack, kwargs))
+        return out
 
 
 class MulFn(torch.autograd.Function):
     @staticmethod
     def forward(ctx, inp: torch.Tensor, weight: torch.Tensor):
         if weight.requires_grad:
-            ctx.save_for_backward(inp, weight)
+            ctx.save_for_backward(inp)
+            ctx.weight = weight
         return inp * weight
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor):
         if not ctx.saved_tensors:
             return None, None
-        inp, weight = ctx.saved_tensors
+        inp, = ctx.saved_tensors
+        weight = ctx.weight
         diff = inp.ndim - weight.ndim
         summed = list(range(diff)) + [i for i, dim in enumerate(weight.shape, diff) if dim == 1]
         weight_grad = dy * inp
@@ -33,14 +74,15 @@ def forward(ctx, inp: torch.Tensor, weight: torch.Tensor):
             diff = inp.ndim - weight.ndim
             ctx.summed = list(range(diff)) + [i for i, dim in enumerate(weight.shape, diff) if dim == 1]
             ctx.batch_size = inp.size(0)
-            ctx.save_for_backward(weight)
+            ctx.weight = weight
+
         return inp + weight
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor):
-        if not ctx.saved_tensors:
+        if not hasattr(ctx, "weight"):
             return None, None
-        weight, = ctx.saved_tensors
+        weight = ctx.weight
         weight_grad = dy
         weight.sum_grad_squared = dy.square()
         if ctx.summed:
@@ -54,15 +96,17 @@ class EinsumFn(torch.autograd.Function):
     @staticmethod
     def forward(ctx, spec: str, inp: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
         if weight.requires_grad:
-            ctx.save_for_backward(inp, weight)
+            ctx.save_for_backward(inp)
+            ctx.weight = weight
             ctx.spec = spec
         return torch.einsum(spec, inp, weight)
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor) -> Tuple[None, torch.Tensor, torch.Tensor]:
         if not ctx.saved_tensors:
             return None, None, None
-        inp, wgt = ctx.saved_tensors
+        inp, = ctx.saved_tensors
+        wgt = ctx.weight
         inputs, output = ctx.spec.split('->')
         lhs, rhs = inputs.split(',')
 
@@ -76,14 +120,16 @@ class GatherFn(torch.autograd.Function):
     @staticmethod
     def forward(ctx, inp: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
         if weight.requires_grad:
-            ctx.save_for_backward(inp, weight)
+            ctx.save_for_backward(inp)
+            ctx.weight = weight
         return torch.gather(weight, 0, inp)
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor) -> Tuple[None, torch.Tensor]:
         if not ctx.saved_tensors:
             return None, None
-        inp, wgt = ctx.saved_tensors
+        inp, = ctx.saved_tensors
+        wgt = ctx.weight
         wgt_grad = torch.zeros_like(wgt)
         wgt.sum_grad_squared = wgt_grad.scatter_add(0, inp, dy.square())
         wgt_grad.scatter_add_(0, inp, dy)
@@ -93,45 +139,90 @@ def backward(ctx, dy: torch.Tensor) -> Tuple[None, torch.Tensor]:
 class ReshapeFn(torch.autograd.Function):
     @staticmethod
     def forward(ctx, weight: torch.Tensor, new_shape: List[int]) -> torch.Tensor:
+        out = TrueGradTensor(weight.reshape(new_shape).detach().requires_grad_(True))
         if weight.requires_grad:
             ctx.save_for_backward(weight)
+            ctx.out = out
             ctx.original_shape = weight.size()
-        return weight.reshape(new_shape)
+        return out
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor) -> Tuple[None, torch.Tensor]:
         if not ctx.saved_tensors:
-            return None
+            return None, None
         wgt, = ctx.saved_tensors
-        if hasattr(wgt, "sum_grad_squared"):
-            wgt.sum_grad_squared = wgt.sum_grad_squared.reshape(ctx.original_shape)
-        return dy.reshape(ctx.original_shape)
+        if ctx.out.sum_grad_squared is not None:
+            wgt.sum_grad_squared = ctx.out.sum_grad_squared.reshape(ctx.original_shape)
+        return dy.reshape(ctx.original_shape), None
 
 
 class ExpandFn(torch.autograd.Function):
     @staticmethod
     def forward(ctx, weight: torch.Tensor, new_shape: List[int]) -> torch.Tensor:
+        out = TrueGradTensor(weight.expand(new_shape))
         if weight.requires_grad:
             ctx.save_for_backward(weight)
+            ctx.out = out
             ctx.summed = [i for i, d in enumerate(new_shape) if d != -1]
-        return weight.reshape(new_shape)
+        return out
 
     @staticmethod
     def backward(ctx, dy: torch.Tensor) -> Tuple[None, torch.Tensor]:
         if not ctx.saved_tensors:
-            return None
+            return None, None
         wgt, = ctx.saved_tensors
-        if hasattr(wgt, "sum_grad_squared") and ctx.summed:
-            wgt.sum_grad_squared = wgt.sum_grad_squared.sum(ctx.summed)
+        if ctx.out.sum_grad_squared is not None and ctx.summed:
+            wgt.sum_grad_squared = ctx.out.sum_grad_squared.sum(ctx.summed)
         return dy.sum(ctx.summed)
 
 
+class WrapFn(torch.autograd.Function):
+    @staticmethod
+    def forward(ctx, fn, args, kwargs) -> torch.Tensor:
+        ctx.fn = fn
+        ctx.args = args
+        ctx.kwargs = kwargs
+        return fn(*args, **kwargs)
+
+    @staticmethod
+    def backward(ctx, dy: torch.Tensor) -> Tuple[None, None, None, None]:
+        def _backward(fn: Callable[[torch.Tensor], torch.Tensor], attr: str):
+            def _fn(x: torch.Tensor):
+                if isinstance(x, torch.nn.Parameter):
+                    x = x.data
+                if not isinstance(x, torch.Tensor) or not torch.is_floating_point(x):
+                    return x
+                x = fn(x.detach())
+                x.requires_grad_(True)
+                return x
+
+            args = tree_map(_fn, ctx.args)
+            kwargs = tree_map(_fn, ctx.kwargs)
+
+            with torch.enable_grad():
+                out = ctx.fn(args, kwargs)
+                torch.autograd.backward(out, tree_map(_fn, dy))
+
+            for p, a in zip(list(ctx.args) + list(ctx.kwargs.values()), list(args) + list(kwargs.values())):
+                if not isinstance(p, torch.nn.Parameter):
+                    continue
+                if hasattr(p, attr) and getattr(p, attr) is not None:
+                    a.grad = getattr(p, attr) + a.grad
+                setattr(p, attr, a.grad)
+
+        _backward(torch.square, "sum_grad_squared")
+        _backward(lambda x: x, "grad")
+
+        return None, None, None, None
+
+
 mul = MulFn.apply
 add = AddFn.apply
 einsum = EinsumFn.apply
 gather = GatherFn.apply
 reshape = ReshapeFn.apply
 expand = ExpandFn.apply
+wrap = WrapFn.apply
 
 
 def matmul(inp: torch.Tensor, wgt: torch.Tensor):