Efficient per example training with combination of shared and per example params #6093

adam-hartshorne · 2021-03-17T00:21:02Z

adam-hartshorne
Mar 17, 2021

If I wish to optimize a model on a per example basis, I was wondering what is the best way to deal with a combination of shared parameters and ones that are being optimized for each example in the dataset.

Currently, I just pass all the data in as one, plus the parameter dict, into my cost function, then extract from the dict all the parameters use vmap on each example. Then I use vmap to call an inner function in order to do per example from the data and the per example param, plus all the shared params. This is highly memory inefficient as all the data is getting passed into the cost functions and thus gradients for all of it is being calculated at once, rather than on a per example basis.

e.g. Minimal example of the sort of poorly devised structure I currently have

N = number of examples in dataset
M = number of data points in each example

data = jnp.array(np.random.random(N,M))

params = {'shared_param' : jnp.ones(M),
'per_example_param': jnp.ones(N, M)
}

def inner_cost_func(shared_param, per_example_param, data):
  # Do Stuff

def cost_func_from_params(params, data):   
   shared_param = params['shared_param']
   per_example_param = params['per_example_param']
   inner_cost_func_map = vmap(inner_cost_func, ((None, (0), (0))
   
   output = inner_cost_func_map(shared_param, per_example_param, data)
   return jnp.min(output)

def step(params, optimizer_state, data):
   value, grads = jax.value_and_grad(cost_func_from_params)(params, data)
   updates, opt_state = optimizer.update(grads, optimizer_state, params)
   return value, optax.apply_updates(params, updates), opt_state

Any help much appreciated how I should set this up so that a) I can do call the cost function on a per example basis and b) that the per_example_param is correctly trained.

davisyoshida · 2021-03-19T01:32:08Z

davisyoshida
Mar 19, 2021
Collaborator

Would moving the gradient inside the vmap, then accumulating the shared grad manually work? i.e.:

import numpy as np

import jax
import jax.numpy as jnp
import optax

def cost_func(shared_p, example_p, example):
    return jnp.inner(shared_p, example) + jnp.inner(example_p, example)

def step(params, optimizer, optimizer_state, data):
    value, (example_shared_grads, per_example_grads) = jax.vmap(
        jax.value_and_grad(cost_func, argnums=(0, 1)),
        (None, 0, 0)
    )(params['shared_param'], params['per_example_param'], data)

    shared_grad = jnp.sum(example_shared_grads, axis=0)
    grads = {'shared_param': shared_grad, 'per_example_param': per_example_grads}
    updates, opt_state = optimizer.update(grads, optimizer_state, params)
    return value, optax.apply_updates(params, updates), opt_state

def main():
    params = {
        'shared_param': jnp.ones(10),
        'per_example_param': jnp.ones((123, 10)),
    }
    optimizer = optax.adam(0.1)
    opt_state = optimizer.init(params)
    data = jnp.array(np.random.randn(123, 10))
    step(params, optimizer, opt_state, data)

4 replies

adam-hartshorne Mar 19, 2021
Author

Thanks. That's great idea.

One thing that is a bit puzzling is in my real world example, that is obviously a lot more than 2 parameters, my presumption was that doing it on a per example basis would save a lot of GPU memory allocation, but that doesn't really seem to be the case.

davisyoshida Mar 19, 2021
Collaborator

Less memory in comparison to writing with explicit batch dimensions? Or in comparison to having all parameters shared?

adam-hartshorne Mar 19, 2021
Author

In comparison to calling the cost function with all the parameters combined, and then within the model where these parameters are used, separate them out and use a vmap based approach (as per the minimal example in my initial post).

davisyoshida Mar 19, 2021
Collaborator

If you're trying to process fewer examples at once, wouldn't that just correspond to lowering the batch size?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient per example training with combination of shared and per example params #6093

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Efficient per example training with combination of shared and per example params #6093

Uh oh!

Uh oh!

adam-hartshorne Mar 17, 2021

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

davisyoshida Mar 19, 2021 Collaborator

Uh oh!

adam-hartshorne Mar 19, 2021 Author

Uh oh!

davisyoshida Mar 19, 2021 Collaborator

Uh oh!

adam-hartshorne Mar 19, 2021 Author

Uh oh!

davisyoshida Mar 19, 2021 Collaborator

adam-hartshorne
Mar 17, 2021

Replies: 1 comment 4 replies

davisyoshida
Mar 19, 2021
Collaborator

adam-hartshorne Mar 19, 2021
Author

davisyoshida Mar 19, 2021
Collaborator

adam-hartshorne Mar 19, 2021
Author

davisyoshida Mar 19, 2021
Collaborator