Ergonomic way to extract a single iteration output from a scan #20054

davisyoshida · 2024-03-03T23:52:55Z

davisyoshida
Mar 3, 2024
Collaborator

It's common to extract the activations for a single hidden layer in a network, but this gets annoying when using a scan over parameters. Here's a toy example:

import jax
import jax.numpy as jnp

layers = 16
dim = 128
Ws = jax.random.normal(jax.random.PRNGKey(0), (layers, dim, dim))
x = jax.random.normal(jax.random.PRNGKey(1), (dim,))

def f(carry, W):
    h = W @ carry
    return h, h

final_state, all_hidden = jax.lax.scan(f, x, Ws)
the_state_I_want = all_hidden[3]

I'm not 100% sure, but I think this requires XLA to instantiate the entire layers x dim x dim output just so it can index into it. Is this true? If not then please ignore the rest of this question.

Usually, returning a list all the activations from your network is fine because users can grab whichever ones they want, and rely on DCE to avoid keeping the unused ones around. To get the same memory requirement with scan I think you have to do something like:

def f2(carry, W):
    x, counter, write_out = carry
    h = W @ x
    write_out = jnp.where(counter == 3, h, write_out)
    counter = counter + 1
    return (h, counter, write_out), None

init_carry = (x, 0, jnp.zeros_like(x))
(final_state, _, the_state_I_want), _ = jax.lax.scan(f2, init_carry, Ws)

When you're scanning more complex functions, it's pretty intrusive to implement something like this, since the layer implementation needs to be aware it's being used for scan. Does anyone have any ideas for a cleaner way to accomplish this?

Answered by jakevdp

Mar 11, 2024

I see - in that case your approach is probably best. I don't think XLA will fuse the indexing with the scan (though you could check by outputting the optimized HLO)

View full answer

davisyoshida · 2024-03-05T03:52:38Z

davisyoshida
Mar 5, 2024
Collaborator Author

Here's a wrapper that does it, although I'll need to jump through some more hoops to make it work with Flax.

def scan_and_extract_carry(f, init, xs, iters):
    @wraps(f)
    def inner(carry, x):
        orig_carry, iter_to_val, counter = carry
        new_carry, out = f(orig_carry, x)
        iter_to_val = {
            i: jax.tree_map(
                lambda curr_val, new_val: jnp.where(counter == i, new_val, curr_val),
                curr_tree, new_carry
            )
            for i, curr_tree in iter_to_val.items()
        }
        return (new_carry, iter_to_val, counter + 1), out

    storage = jax.tree_map(jnp.zeros_like, init)
    iter_to_val = {i: storage for i in iters}

    new_init = (init, iter_to_val, 0)
    (final_carry, stored_carries, _), outputs = jax.lax.scan(inner, new_init, xs)
    return final_carry, stored_carries, outputs

def f(val, x):
    carry = val + x
    return carry, -x

final, stored, outputs = scan_and_extract_carry(
    f,
    init=0,
    xs=jnp.arange(10),
    iters=[3, 5, 7]
)

I think this probably does save memory, since compiling output for a simple scan->slice combination seems to lead to holding all the carries in memory at once (the input to wrapped_slice_computation) below:

Example

@jax.jit
def slice_hidden(init, xs):
    final_state, all_hidden = jax.lax.scan(lambda carry, x: (carry + x, carry), init, xs)
    return final_state, all_hidden[3]

lowered = jax.jit(slice_hidden).lower(0, jnp.arange(10))
compiled = lowered.compile()
print(compiled.as_text())

HloModule jit_slice_hidden, is_scheduled=true, entry_computation_layout={(s32[], s32[10]{0})->(s32[], s32[])}, allow_spmd_sharding_propagation_to_output={true,true}, frontend_attributes={fingerprint_before_lhs="997bbf096621d5518b893011c95151c7"}

%fused_dynamic_update_slice (param_0: s32[10], param_1.8: s32[], param_2.6: s32[]) -> s32[10] {
  %param_0 = s32[10]{0} parameter(0)
  %param_1.8 = s32[] parameter(1)
  %bitcast.56 = s32[1]{0} bitcast(s32[] %param_1.8)
  %param_2.6 = s32[] parameter(2)
  %constant_8 = s32[] constant(0)
  %compare.4 = pred[] compare(s32[] %param_2.6, s32[] %constant_8), direction=LT, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/lt" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %constant_7 = s32[] constant(10)
  %add.6 = s32[] add(s32[] %param_2.6, s32[] %constant_7), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %select.3 = s32[] select(pred[] %compare.4, s32[] %add.6, s32[] %param_2.6), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/select_n" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  ROOT %dynamic-update-slice.1 = s32[10]{0} dynamic-update-slice(s32[10]{0} %param_0, s32[1]{0} %bitcast.56, s32[] %select.3), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/dynamic_update_slice" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
}

%fused_add (param_0.1: s32[], param_1.4: s32[10], param_2.8: s32[]) -> s32[] {
  %param_0.1 = s32[] parameter(0)
  %param_1.4 = s32[10]{0} parameter(1)
  %param_2.8 = s32[] parameter(2)
  %constant_16 = s32[] constant(0)
  %compare.6 = pred[] compare(s32[] %param_2.8, s32[] %constant_16), direction=LT, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/lt" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %constant_15 = s32[] constant(10)
  %add.8 = s32[] add(s32[] %param_2.8, s32[] %constant_15), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %select.5 = s32[] select(pred[] %compare.6, s32[] %add.8, s32[] %param_2.8), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/select_n" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %dynamic-slice.1 = s32[1]{0} dynamic-slice(s32[10]{0} %param_1.4, s32[] %select.5), dynamic_slice_sizes={1}, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/dynamic_slice[slice_sizes=(1,)]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %bitcast.57 = s32[] bitcast(s32[1]{0} %dynamic-slice.1)
  ROOT %add.3 = s32[] add(s32[] %param_0.1, s32[] %bitcast.57), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
}

%wrapped_add_computation (param_0.9: s32[], param_1.9: s32[]) -> s32[] {
  %param_0.9 = s32[] parameter(0)
  %param_1.9 = s32[] parameter(1)
  ROOT %add.9 = s32[] add(s32[] %param_0.9, s32[] %param_1.9), metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
}

%region_0.3 (arg_tuple.4: (s32[], s32[], s32[10], s32[10])) -> (s32[], s32[], s32[10], s32[10]) {
  %constant_9 = s32[] constant(1)
  %arg_tuple.4 = (s32[], s32[], s32[10]{0}, s32[10]{0}) parameter(0)
  %get-tuple-element.12 = s32[] get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %arg_tuple.4), index=0
  %get-tuple-element.13 = s32[] get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %arg_tuple.4), index=1
  %get-tuple-element.19 = s32[10]{0} get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %arg_tuple.4), index=3
  %get-tuple-element.14 = s32[10]{0} get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %arg_tuple.4), index=2
  %loop_dynamic_update_slice_fusion = s32[10]{0} fusion(s32[10]{0} %get-tuple-element.14, s32[] %get-tuple-element.13, s32[] %get-tuple-element.12), kind=kLoop, calls=%fused_dynamic_update_slice, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/dynamic_update_slice" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %loop_add_fusion = s32[] fusion(s32[] %get-tuple-element.13, s32[10]{0} %get-tuple-element.19, s32[] %get-tuple-element.12), kind=kLoop, calls=%fused_add, control-predecessors={%loop_dynamic_update_slice_fusion}, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %wrapped_add = s32[] fusion(s32[] %get-tuple-element.12, s32[] %constant_9), kind=kLoop, calls=%wrapped_add_computation, control-predecessors={%loop_add_fusion, %loop_dynamic_update_slice_fusion}, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/body/add" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  ROOT %tuple.4 = (s32[], s32[], s32[10]{0}, s32[10]{0}) tuple(s32[] %wrapped_add, s32[] %loop_add_fusion, s32[10]{0} %loop_dynamic_update_slice_fusion, s32[10]{0} %get-tuple-element.19)
}

%wrapped_compare_computation (param_0.10: s32[], param_1.10: s32[]) -> pred[] {
  %param_0.10 = s32[] parameter(0)
  %param_1.10 = s32[] parameter(1)
  ROOT %compare.7 = pred[] compare(s32[] %param_0.10, s32[] %param_1.10), direction=LT, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/cond/lt" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
}

%region_1.25 (arg_tuple.26: (s32[], s32[], s32[10], s32[10])) -> pred[] {
  %constant_31 = s32[] constant(10)
  %arg_tuple.26 = (s32[], s32[], s32[10]{0}, s32[10]{0}) parameter(0)
  %get-tuple-element.27 = s32[] get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %arg_tuple.26), index=0
  ROOT %wrapped_compare = pred[] fusion(s32[] %get-tuple-element.27, s32[] %constant_31), kind=kLoop, calls=%wrapped_compare_computation, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while/cond/lt" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
}

%wrapped_copy_computation (param_0.6: s32[]) -> s32[] {
  %param_0.6 = s32[] parameter(0)
  ROOT %copy.14 = s32[] copy(s32[] %param_0.6)
}

%wrapped_copy_computation.1 (param_0.7: s32[]) -> s32[] {
  %param_0.7 = s32[] parameter(0)
  ROOT %copy.15 = s32[] copy(s32[] %param_0.7)
}

%wrapped_broadcast_computation (param_0.8: s32[]) -> s32[10] {
  %param_0.8 = s32[] parameter(0)
  ROOT %broadcast.2 = s32[10]{0} broadcast(s32[] %param_0.8), dimensions={}
}

%wrapped_slice_computation (param_0.11: s32[10]) -> s32[1] {
  %param_0.11 = s32[10]{0} parameter(0)
  ROOT %slice.2 = s32[1]{0} slice(s32[10]{0} %param_0.11), slice={[3:4]}, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/slice[start_indices=(3,) limit_indices=(4,) strides=None]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=76}
}

ENTRY %main.52 (Arg_0.1: s32[], Arg_1.2: s32[10]) -> (s32[], s32[]) {
  %constant_0 = s32[] constant(0)
  %Arg_0.1 = s32[] parameter(0), sharding={replicated}
  %Arg_1.2 = s32[10]{0} parameter(1), sharding={replicated}
  %wrapped_copy = s32[] fusion(s32[] %constant_0), kind=kLoop, calls=%wrapped_copy_computation
  %wrapped_copy.1 = s32[] fusion(s32[] %Arg_0.1), kind=kLoop, calls=%wrapped_copy_computation.1
  %wrapped_broadcast = s32[10]{0} fusion(s32[] %wrapped_copy), kind=kLoop, calls=%wrapped_broadcast_computation
  %tuple.2 = (s32[], s32[], s32[10]{0}, s32[10]{0}) tuple(s32[] %wrapped_copy, s32[] %wrapped_copy.1, s32[10]{0} %wrapped_broadcast, s32[10]{0} %Arg_1.2)
  %while.0 = (s32[], s32[], s32[10]{0}, s32[10]{0}) while((s32[], s32[], s32[10]{0}, s32[10]{0}) %tuple.2), condition=%region_1.25, body=%region_0.3, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while[cond_nconsts=0 body_nconsts=1]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}, backend_config={"known_trip_count":{"n":"10"}}
  %get-tuple-element.49 = s32[] get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %while.0), index=1, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while[cond_nconsts=0 body_nconsts=1]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %get-tuple-element.3 = s32[10]{0} get-tuple-element((s32[], s32[], s32[10]{0}, s32[10]{0}) %while.0), index=2, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/while[cond_nconsts=0 body_nconsts=1]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=75}
  %wrapped_slice = s32[1]{0} fusion(s32[10]{0} %get-tuple-element.3), kind=kLoop, calls=%wrapped_slice_computation, metadata={op_name="jit(slice_hidden)/jit(main)/jit(slice_hidden)/slice[start_indices=(3,) limit_indices=(4,) strides=None]" source_file="/home/davis/src/scratch/jax/scan_extraction.py" source_line=76}
  %bitcast.51 = s32[] bitcast(s32[1]{0} %wrapped_slice)
  ROOT %tuple.51 = (s32[], s32[]) tuple(s32[] %get-tuple-element.49, s32[] %bitcast.51)
}

5 replies

jakevdp Mar 11, 2024
Maintainer

This looks like a decent solution. Note however that scan returns an array containing the outputs of each iteration, so you could also access it using something like the_state_I_want = state_vector[3]

davisyoshida Mar 11, 2024
Collaborator Author

@jakevdp The original motivation was to avoid materializing the state just to slice a single element out of it. However, I'm not confident in my understanding of what XLA can/can't fuse. In the case where this is a scan over a network, state_vector will be the activations for every layer, which will often be prohibitively large. Is there some way to get that index fused into the scan? Or maybe it happens already?

jakevdp Mar 11, 2024
Maintainer

I see - in that case your approach is probably best. I don't think XLA will fuse the indexing with the scan (though you could check by outputting the optimized HLO)

Answer selected by davisyoshida

davisyoshida Mar 11, 2024
Collaborator Author

Yeah the HLO is in the collapsed example block of the comment. I believe it's not fused, but I'm not fluent enough to be 100% sure. Thanks!

alonfnt Mar 19, 2024

I'd love a bit of explanation on the HLO of ENTRY (collapsed), if possible @jakevdp, just out of curiosity, as it seems it is fusing get-tuple.3 and wrapped_slice_computation, but I presume the memory is still being used at wrapped_broadcast?

This is just because I'd like to get more insight on understanding HLO text, so feel free to ignore :)

Ergonomic way to extract a single iteration output from a scan #20054

Uh oh!

davisyoshida Mar 3, 2024 Collaborator

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

davisyoshida Mar 5, 2024 Collaborator Author

Uh oh!

Uh oh!

jakevdp Mar 11, 2024 Maintainer

Uh oh!

davisyoshida Mar 11, 2024 Collaborator Author

Uh oh!

jakevdp Mar 11, 2024 Maintainer

Uh oh!

Uh oh!

davisyoshida Mar 11, 2024 Collaborator Author

Uh oh!

alonfnt Mar 19, 2024

davisyoshida
Mar 3, 2024
Collaborator

Replies: 1 comment 5 replies

davisyoshida
Mar 5, 2024
Collaborator Author

jakevdp Mar 11, 2024
Maintainer

davisyoshida Mar 11, 2024
Collaborator Author

jakevdp Mar 11, 2024
Maintainer

davisyoshida Mar 11, 2024
Collaborator Author