How to find/profile a RAM memory leak in JAX? #13171

jakiw · 2022-11-09T16:35:30Z

jakiw
Nov 9, 2022

Hey, I have a training loop which leaks memory somewhere. While running the loop the ram usage increases constantly. Using the jax.profiler.save_device_memory_profile I can only profile GPU memory which looks as expected. If I set device to CPU in the profiler while profiling, I just get an empty graph (probably since it can't track it when the main device is set to GPU)?

I tried using memory_profiler, but all that I get is the following output:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   443   5364.6 MiB   5364.6 MiB           1   @profile
   444                                         def to_profile(params, opt_state):
   445   5364.6 MiB      0.0 MiB           1       rng = random.PRNGKey(0)
   446   5364.6 MiB      0.0 MiB           1       N_epochs = 1
   447   5364.6 MiB      0.0 MiB           1       batch_size = 8
   448   5364.6 MiB      0.0 MiB           1       train_size = dataset.shape[0]
   449   5364.6 MiB      0.0 MiB           1       steps_per_epoch = train_size // batch_size
   450   5364.6 MiB      0.0 MiB           1       print(steps_per_epoch)
   451   6154.3 MiB      0.0 MiB           2       for k in range(N_epochs):
   452   5365.1 MiB      0.5 MiB           1           rng, step_rng = random.split(rng)
   453   5389.8 MiB     24.7 MiB           1           perms = jax.random.permutation(step_rng, train_size)
   454   5389.8 MiB      0.0 MiB           1           perms = perms[:steps_per_epoch * batch_size]  # skip incomplete batch
   455   5389.8 MiB      0.0 MiB           1           perms = perms.reshape((steps_per_epoch, batch_size))
   456   5389.8 MiB      0.0 MiB           1           losses = []
   457   6154.3 MiB      0.4 MiB        6251           for perm in perms:
   458   6154.3 MiB      0.0 MiB        6250               batch = dataset[perm, :, :, :]
   459   6154.3 MiB      0.0 MiB        6250               rng, step_rng = random.split(rng)
   460   6154.3 MiB    764.1 MiB        6250               loss, params, opt_state = update_step(params, step_rng, batch, opt_state, score_model)
   461   6154.3 MiB      0.0 MiB        6250               losses.append(loss)
   462   6154.3 MiB      0.0 MiB           1           mean_loss = jnp.mean(jnp.array(losses))
   463   6154.3 MiB      0.0 MiB           1           if k % 1 == 0:
   464   6154.3 MiB      0.0 MiB           1               print("Epoch %d \t, Loss %f " % (k, mean_loss))
   465                                         
   466   6154.3 MiB      0.0 MiB           1       return params, opt_state

The 6000 MB show that there is a leak (I am only working with CIFAR, this shouldn't be this much ram), but its very hard to understand where the leak is appearing. update_step is a JIT-compiled function (training loop of SGMs), and I don't see how it should leak memory:

def loss_fn(params, model, rng, batch):
    rng, step_rng = random.split(rng)
    N_batch = batch.shape[0]
    t = random.randint(step_rng, (N_batch,1), 1, R)/(R-1)
    mean_coeff = mean_factor(t)
    #is it right to have the square root here for the loss?
    vs = var(t)
    stds = jnp.sqrt(vs)
    rng, step_rng = random.split(rng)
    noise = random.normal(step_rng, batch.shape)
    stds = stds[:, :, None, None]
    mean_coeff = mean_coeff[:, :, None, None]

    xt = batch * mean_coeff + noise * stds
    output = score_model.apply(params, xt, t.flatten())
    loss = jnp.mean((noise + output*stds)**2)
    return loss

@partial(jit, static_argnums=[4])
def update_step(params, rng, batch, opt_state, model):
    val, grads = jax.value_and_grad(loss_fn)(params, model, rng, batch)
    updates, opt_state = optimizer.update(grads, opt_state)
    params = optax.apply_updates(params, updates)
    return val, params, opt_state

Any pointers on where I could be leaking memory or how I go on profiling this are greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to find/profile a RAM memory leak in JAX? #13171

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to find/profile a RAM memory leak in JAX? #13171

Uh oh!

jakiw Nov 9, 2022

Replies: 0 comments

jakiw
Nov 9, 2022