Implementing Kaczmarz in Jax #11547

yaroslavvb · 2022-07-19T22:09:55Z

yaroslavvb
Jul 19, 2022

Kaczmarz is popular outside of DL (aka normalized LMS in Signal Processing, ART in tomography), what would be a good way of implementing it in Jax?

Basic idea:

y=g(x): prediction/predictor
J=f(y): objective/loss
w = w - [f(g(..))]': standard GD update
w = w-[f(g(..))]' / norm(g'(..))**2 : Kaczmarz update

A batched version with multiclass predictor g, would look as follows:

step=0
for x in data:
  y = g(x, w)
  J = f(y)
  for c in range(num_classes):
     df = D(J, y[c]) # derivative of objective J w.r.t. output c
     dg = D(y[c], w) # derivative of output c w.r.t. to params
     step += df * dg / norm(dg)**2
w=w-step

@mattjj

Answered by mattjj

Jul 22, 2022

Thanks so much for this question! I pulled @froystig in to think about this, and to teach me what Kaczmarz is. We wrote this comment together!

Let's change notation and clarify the problem statement: let's say we have h = f . g where g: R^d -> R^n is the prediction function, with d the dimension of the parameter and n the number of classes (suppressing the dependence on input data x for convenience), and f : R^n -> R is the loss function. (Notice f shouldn't have an input of dimension d, i.e. the weights, a typo in the OP we think!) Say w ∈ R^d is the current parameter. Notice that ∇f(g(w)) ∈ R^n, and ∂g(w) ∈ R^{n x d}, where the latter is just notation for the Jacobian matrix.

To draw an…

View full answer

YouJiacheng · 2022-07-21T05:34:34Z

YouJiacheng
Jul 21, 2022

Assume num of elements w > y > J

y = g(x, w)
dg_vec = jax.jacrev(lambda w: g(x, w)(w)
df_vec_T = jax.jacfwd(lambda y: f(y, w))(y)
# dg_vec[c] = dg
# df_vec_T[..., c] = df, assume w is an array

2 replies

yaroslavvb Jul 21, 2022
Author

What if you have a function with k layers?
IE, g(w1,g(w2,g(w3,x)))

Calling 3 jacrev to get derivatives w.r.t. w1, w2, w3 is not going to reuse work, right?

froystig Jul 22, 2022
Maintainer

Treating this comment as a standalone question about jacrev: I don't think you'd need to call jacrev any more than once. You can define the composite function:

G = lambda w, x: g(w[0], g(w[1], g(w[2], x)))

and call jax.jacrev(lambda w: G(w, x))(w), where w = (w1, w2, w3) in your comment above.

I'm not sure whether this is important for the OP any longer, but wanted to make sure we didn't drop this sub-question you left here as well!

mattjj · 2022-07-22T18:25:20Z

mattjj
Jul 22, 2022
Maintainer

Thanks so much for this question! I pulled @froystig in to think about this, and to teach me what Kaczmarz is. We wrote this comment together!

Let's change notation and clarify the problem statement: let's say we have h = f . g where g: R^d -> R^n is the prediction function, with d the dimension of the parameter and n the number of classes (suppressing the dependence on input data x for convenience), and f : R^n -> R is the loss function. (Notice f shouldn't have an input of dimension d, i.e. the weights, a typo in the OP we think!) Say w ∈ R^d is the current parameter. Notice that ∇f(g(w)) ∈ R^n, and ∂g(w) ∈ R^{n x d}, where the latter is just notation for the Jacobian matrix.

To draw an analogy to the linear least squares setting, we can make a multiclass generalization of Eq. 5.2 of this paper. In that setting, g is linear so that ∂g(w) = A for some matrix A ∈ R^{n x d}, and moreover f is quadratic of the form y ↦ 1/2 || b - y ||**2 so that its gradient is ∇f(A @ w) = b - A @ w. The update step direction from Eq. 5.2 is then ((b - A @ w) / row_norms(A)**2 ) @ A, where the division is elementwise.

Back in the nonlinear and non-least-squares setting, substituting back ∂g(w) for A and ∇f(g(w)) for b - A @ w (the prediction error!), the update step becomes (∇f(g(w)) / row_norms(∂g(w))**2) @ ∂g(w). That can be implemented directly in JAX APIs, which is pretty satisfying! But we can factor it a bit to save some intermediate work:

import jax
import jax.numpy as jnp

def kaczmarz_step(f, g, w, x):
  y, g_vjp = jax.vjp(lambda w: g(w, x), w)
  A, = jax.vmap(g_vjp)(jnp.eye(y.shape[0]))  # assumes w.ndim >> y.ndim
  return (jax.grad(f)(y) / (A * A).sum(1)) @ A

If we want to reduce memory usage and not instantiate A all at once, we can basically replace the vmap with a jax.lax.map, and the @ A at the end with an application of g_vjp:

import jax
import jax.numpy as jnp

def kaczmarz_step(f, g, w, x):
  y, g_vjp = jax.vjp(lambda w: g(w, x), w)
  normsq = jax.lax.map(lambda e: (g_vjp(e)[0]**2).sum(), jnp.eye(y.shape[0]))
  return g_vjp(jax.grad(f)(y) / normsq)[0]

What do you think? Is this generalization of Kaczmarz (to nonlinear and multiclass) the one you're looking for?

It would be great to validate this numerically against some known reference implementation and/or problem.

-- @froystig and @mattjj

7 replies

jekbradbury Jul 22, 2022

is the one with vmap somehow not correct?

typedfemale Jul 22, 2022

I think the sum should be: .sum((1, 2))?

froystig Jul 22, 2022
Maintainer

A ought to have two dimensions (its shape is (n, d)), so a sum could only involve axes 0, 1, or [0, 1]. (The first of those three doesn't typecheck for me—neither conceptually nor concretely in the expressions we wrote above.)

In generalizing Kaczmarz to multiclass, I could see one choosing to normalize element-wise by respective row 2-norms (i.e. sum(1)), or normalize uniformly by the matrix Frobenius norm (i.e. sum([0, 1])). @mattjj and I didn't spend quite enough time puzzling through which would make more sense and then went with the former in the example. It seemed to us that @yaroslavvb's original pseudocode was suggesting that. But maybe the latter is what you mean here? That seems like a justifiable choice as well!

yaroslavvb Jul 23, 2022
Author

BTW, here's a (super-inefficient) implementation+test for arbitrary quadratic loss. To keep the "converge-in-one-step" property of Kaczmarz, use weighted norm. Weighted norm coincides with regular norm when using x^2/2 loss.

yaroslavvb Jul 23, 2022
Author

@froystig, @mattjj summarizing thoughts on the "right" form of Kaczmarz update for multiclass.

For a single datapoint, n classes, the update in your implementation may be derived from Eq 3.11 in this paper

when rows of A are mutually orthogonal, the pseudo-inverse reduces to diagonal inverse, and is equivalent to normalizing rows of A by their norms squared, producing implementation you provided

For a generic quadratic loss f with Hessian H, the residual no longer corresponds to [f(g(x))]'. We can refactor Kaczmarz update to rely on [f(g(x))]' rather than residual as follows
When rows of H^{1/2} A are mutually orthogonal, pseudo-inverse reduces to diagonal inverse, and we are back to row-normalized update. Identifying parts with gradient of f(g(w)) we get the following update in Jax

hess=jax.hessian(f)(jnp.zeros(num_classes))
sqrt_hess = jnp.linalg.cholesky(hess)
for x in xs:
    y = g(w, x)
    for i in range(num_classes):
        def gi(w):
            return g(w, x)[i]
        def norm_sqr(x):
            return (x*x).sum()

        vec = jax.grad(gi)(w)
        norm2 = norm_sqr(sqrt_hess @ vec)

        grad_g = jax.grad(gi)(w)
        grad_f = jax.grad(f)(y)[i]
        w = w - grad_f * grad_g / norm2

(correctness check)

This can be done efficiently by modifying batched norm computation from above as follows
normsq = jax.lax.map(lambda e: (g_vjp(e)[0]**2).sum(), sqrt_hess)
(correctness check)

For n classes, m examples, standard approach is to treat it as a system of n*m linear equations, with vector w, from Rao "Linear Models" book)

We get different updates depending how we arrange these n*m equations in blocks. If we use blocks of size n corresponding to equations for a single example, and these equations are orthogonal, then we again recover update in @froystig implementation above

yaroslavvb · 2022-07-22T22:03:32Z

yaroslavvb
Jul 22, 2022
Author

Thanks for the in-depth response!

yes, your multiclass formulation is what I meant
your nonlinear extension is the same as mine, it is correct for |y|^2/2 loss, which has identity Hessian
nonlinear extension does not work for |y|^2 loss, need to weight output class dimensions with Hessian of loss like here

batch-Kaczmarz with least-squares loss converges in 1 step for any orthogonal set of examples (whereas GD needs them to be orthonormal)
In high dimensions, random vectors are already almost orthogonal, so correct implementation should "almost" converge in 1 step on random data.

I've tested your implementation, and it passes this test -- colab

0 replies

yaroslavvb · 2022-08-10T16:48:06Z

yaroslavvb
Aug 10, 2022
Author

Turns out there's a much better formulation for Kaczmarz for multiclass problems. The $k$ equations for $k$ classes reuse coefficients so it makes sense to combine them into a single equation. Combine by taking a weighted sum, with residuals as weights. Kaczmarz update for the resulting equation becomes a form of normalized gradient descent.

For f(g(x)) where f is the loss function and g(x) is the model, we have the following update

Computing this update has almost the same FLOPs as regular gradient step, but I suspect wall-clock time is 2x worse, since the sum over weighted per-example gradients $h_i'$ is hard to parallelize

$$\sum_i \frac{h_i'}{||h_i'(x)||^2}$$

0 replies

Implementing Kaczmarz in Jax #11547

Uh oh!

Uh oh!

yaroslavvb Jul 19, 2022

Replies: 4 comments · 9 replies

Uh oh!

YouJiacheng Jul 21, 2022

Uh oh!

yaroslavvb Jul 21, 2022 Author

Uh oh!

froystig Jul 22, 2022 Maintainer

Uh oh!

Uh oh!

mattjj Jul 22, 2022 Maintainer

Uh oh!

jekbradbury Jul 22, 2022

Uh oh!

typedfemale Jul 22, 2022

Uh oh!

froystig Jul 22, 2022 Maintainer

Uh oh!

Uh oh!

yaroslavvb Jul 23, 2022 Author

Uh oh!

Uh oh!

yaroslavvb Jul 23, 2022 Author

Uh oh!

Uh oh!

yaroslavvb Jul 22, 2022 Author

Uh oh!

Uh oh!

yaroslavvb Aug 10, 2022 Author

yaroslavvb
Jul 19, 2022

Replies: 4 comments 9 replies

YouJiacheng
Jul 21, 2022

yaroslavvb Jul 21, 2022
Author

froystig Jul 22, 2022
Maintainer

mattjj
Jul 22, 2022
Maintainer

froystig Jul 22, 2022
Maintainer

yaroslavvb Jul 23, 2022
Author

yaroslavvb Jul 23, 2022
Author

yaroslavvb
Jul 22, 2022
Author

yaroslavvb
Aug 10, 2022
Author