Efficiently computing gradients with increasingly large inputs ? #12767

mchagneux · 2022-10-12T14:26:30Z

mchagneux
Oct 12, 2022

Hi all,

I want to mimick the behaviour of an online learning algorithm where my data is made of sequences $x_{1:T} = (x_1, ..., x_T)$. More precisely I want to do stochastic gradient descent on some parameter $\theta$ in the form of $\theta_{t+1} = \theta_t - \nabla_\theta L_{\theta_t}(x_{1:t+1})$ where $L_\theta$ is some objective function. In practice I don't have the recursion to express $L_{\theta_t}(x_{1:t+1})$ as a function of $L_{\theta_t}(x_{1:t})$ and $x_{t+1}$, so for now I recompute the entire loss function for the new subsequence $x_{1:t+1}$ and take the gradient by autodifferentation. I'm still interested in the kind of values of $\theta$ I get when there's a gradient step for all subsequences. In practice I have many sequences $(x_{1:T}^i)_{i \leq N}$.

Here's the relevant part of the code, where I work with minibatches $(x_{1:T}^j)_{j \in B}$

  def batch_step(params, opt_state, batch, keys):
      avg_elbo_batch_timesteps = jnp.empty((seq_length-1,))
      for i, timestep in enumerate(range(2,seq_length+1)):
          batch_up_to_timestep = jax.lax.dynamic_slice_in_dim(batch, 0, timestep, axis=1)
          neg_elbo_values, grads = jax.vmap(jax.value_and_grad(self.loss, argnums=2), in_axes=(0,0,None))(keys, batch_up_to_timestep, params)
          avg_grads = jax.tree_util.tree_map(partial(jnp.mean, axis=0), grads)
          updates, opt_state = self.optimizer.update(avg_grads, opt_state, params)
          params = optax.apply_updates(params, updates)
          avg_elbo_batch_timesteps = avg_elbo_batch_timesteps.at[i].set(-jnp.mean(neg_elbo_values / batch_up_to_timestep.shape[1]))

      return params, opt_state, jnp.mean(avg_elbo_batch_timesteps)

The code works but

under JIT the compile times are unbearably long when $T$ gets larger (even for $T=20$ it's already hard to work with)
without JIT the code runs out of the box but very slowly

I'm used to using lax.scan for fixed-sized inputs. I tried using batch_up_to_timestep = jax.lax.dynamic_slice_in_dim(batch, 0, timestep, axis=1) when batch is in the carry and timestep in the x of the operand for scan, but then I get that you can't index with a tracer object and I understand why. From what I gathered on the forum there's no workaround to JIT that kind of code because there's no way to reduce the operations inside the for loop to a single HLO call. So it's not a syntax issue, it's a lower level limitation of any compiler. Can anyone confirm this or am I still missing some simple trick ? If not, I'm surprised that Jax is that slow whenever you can't write JIT-able code. Is there anything I'm missing that would make this run faster ? Maybe I could manually JIT all operations inside self.loss that involve fixed-sized inputs, but I feel it's not going to be any use for the jax.grad operation which is my main interest.

Thanks a lot in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficiently computing gradients with increasingly large inputs ? #12767

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Efficiently computing gradients with increasingly large inputs ? #12767

Uh oh!

mchagneux Oct 12, 2022

Replies: 0 comments

mchagneux
Oct 12, 2022