-
How would you suggest computing the per-example gradient on models that contain BatchNorm? The documentation suggests applying For example, in the resnet50 example, I would like to replace the return value of |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 7 replies
-
Can you write out mathematically precisely what the quantity you'd like to compute is? I think that what you're asking for might not exist. I believe the subtle point with batch-norm is that the inputs in a batch are all combined, such that the gradient of the loss with respect to parameters given a vector of inputs You can construct a loss function which would take a single input
This would only help in the cases where you're trying to compute a gradient with respect to an individual input |
Beta Was this translation helpful? Give feedback.
-
Great, your example makes things clearer. The key point is that for a typical network without batchnorm, the matrix I'm not an expert on autodiff upper/lower bounds, but I believe that the fact that the Jacobian is known to be diagonal in the non-batchnorm case is what allows the speed up in computing the per-example gradients. You have to store |
Beta Was this translation helpful? Give feedback.
Great, your example makes things clearer.$A$ represents the parameters of the network, $x$ represents a batch of inputs, $f(x)$ represents a batch of outputs. So $f(x)$ is the function mapping an input batch to an output batch. (I think it makes sense to assume $m=n$ here).$\mathcal{O}(n)$ Jacobian-vector or vector-Jacobian products, each of which are roughly the same order as evaluating the network on the batch. So e.g. $\mathcal{O}(n^2)$ memory, since each batch has $n$ inputs.
In your example,
Essentially the quantity you want to compute is the diagonal of the Jacobian matrix. In the general case, computing a Jacobian will take
The key point is that for a typical network without batc…