Efficient Merging/Pruning Methods: Multi-Modal Transformer Models #20708

peterdavidfagan · 2024-04-11T09:05:59Z

peterdavidfagan
Apr 11, 2024

Hi Jax Community,

I am working on merging/pruning methods for a multi-modal transformer architecture in Flax. I am finding it challenging to get attention layers with both jax-based pruning and merging methods to work efficiently in comparison to regular attention blocks. FLOPs decreases but the model inference takes longer in the pruning/merging implementation despite lower FLOPs.

I had the following questions, I was hoping someone may be able to provide advice on:

For my input embedding sequence I always subset the sequence based on the modality. In my architecture setup method I precompute the indices of the sequence for each modality and pass these as static args to a prune/merge method. Within my prune/merge method I am currently using jax.lax.dynamic_slice to get my modality subsets. I have also experimented with using jnp.take but the performance difference seemed negligible based on initial tests. Is there a recommended way to efficiently slice/subset arrays given I know the indices and can precompile my merge/prune methods with these indices as static_args (for reference in setup of the following Module I precompute these methods, and they are defined in the following file). Would it make more sense to generate masks here for the modalities rather than indexing?
For pruning I am applying jax.lax.approx_max_k. I am trying to refactor my code so I can vmap this method over modality subsets but I am finding that the k param requires an int and so I haven't been able to vmap over this arg; I also didn't want to compile a method for each variation on k. My original implementation of top_k pruning used a for loop over values; the compiled version of this method when added to may attention layer seems to add overhead and not improve the inference speed over the original attention layer implementation.

partial(jax.jit, static_argnums=(2,3))
def compute_top_k_tokens(
    embeddings,
    importance_scores,
    tokenset_idx,
    tokenset_k,
    ):
    """
    Compute top-k tokens per modality based on importance scores.

    Args:
        embeddings: token embeddings
        importance_scores: token importance scores (mean across attention heads + keys)
        tokenset_idx: (start_idx, num_tokens) for each modality in the sequence
        tokenset_k: the number of tokens to compress to for each modality
    """

    # get ids of top-k tokens for each tokenset
    ids = []
    for k, slice_id in zip(tokenset_k, tokenset_idx):
        subset = jax.lax.dynamic_slice_in_dim(importance_scores, slice_id[0], slice_id[1], axis=0)
        #_, idx = jax.lax.top_k(subset, k)
        _, idx = jax.lax.approx_max_k(subset, k) # https://arxiv.org/abs/2206.14286
        idx += slice_id[0]
        ids.append(idx)
    ids = jnp.concatenate(ids)

    return jnp.take(embeddings, ids, axis=0), ids

I am going to continue debugging and will post a resolution should I find one, if anyone spots something obvious in the meantime it would be appreciated if they could chip in their advice on this.

Current Colab where I am debugging: https://colab.research.google.com/drive/1B1mg7r11d9DsPjHs1rwP4_oPXBAn7GHY#scrollTo=C3UQHvhOaqWU

peterdavidfagan · 2024-04-11T10:47:27Z

peterdavidfagan
Apr 11, 2024
Author

I found a bug related to how I was jitting functions in my performance benchmark tests. It now looks like models run faster with compression as expected given the current implementations.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Efficient Merging/Pruning Methods: Multi-Modal Transformer Models #20708

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Efficient Merging/Pruning Methods: Multi-Modal Transformer Models #20708

Uh oh!

Uh oh!

peterdavidfagan Apr 11, 2024

Replies: 1 comment

Uh oh!

peterdavidfagan Apr 11, 2024 Author

peterdavidfagan
Apr 11, 2024

peterdavidfagan
Apr 11, 2024
Author