Implementing Mixtral 8x7B in JAX #19151

defdet · 2024-01-01T18:11:39Z

defdet
Jan 1, 2024

I'm trying to implement Mixtral in JAX and am facing one problem due to the way Mixtral works (regarding MoE block). If you aren't familiar, in Pytorch MoE works this way: after self-attention, we multiply router weights and hidden states (mind some additional processing) after which we get sort of a mapping between each token in the sequence and 2 experts. We then iterate over each expert, collect needed tokens, multiply by some parameters, collect results across experts and move on. The problem is, we can't quite do the same in JAX since number of tokens for each expert is data-dependant and not very consistent. Does it mean we cannot implement Mixtral in JAX? The Pytorch code snippet for MoE block will be provided as the next message.
Any suggestions on how to approach this are very welcome.
Edit: I've tried using a fixed number of tokens for each expert (seq_len // 8) but after some decoder layers hidden states seem to carry basically no information in them.

defdet · 2024-01-01T18:12:31Z

defdet
Jan 1, 2024
Author

from https://github.com/huggingface/transformers/blob/main/src/transformers/models/mixtral/modeling_mixtral.py

def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        """ """
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        # router_logits: (batch * sequence_length, n_experts)
        router_logits = self.gate(hidden_states)

        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
        routing_weights, selected_experts = torch.topk(routing_weights, self.top_k, dim=-1)
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        # we cast back to the input dtype
        routing_weights = routing_weights.to(hidden_states.dtype)

        final_hidden_states = torch.zeros(
            (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype, device=hidden_states.device
        )

        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0)

        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            idx, top_x = torch.where(expert_mask[expert_idx])

            if top_x.shape[0] == 0:
                continue

            # in torch it is faster to index using lists than torch tensors
            top_x_list = top_x.tolist()
            idx_list = idx.tolist()

            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x_list].reshape(-1, hidden_dim)
            current_hidden_states = expert_layer(current_state) * routing_weights[top_x_list, idx_list, None]

            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
        return final_hidden_states, router_logits

4 replies

clintg6 Feb 6, 2024

Hi @defdet,

Since the number of tokens processed by each expert changes and JAX doesnt allow dynamic arrays, you can try rewriting the for loop to loop over the tokens rather than the number of experts because you know that each token gets assigned to exactly k experts.

for token_idx in range(sequence_length):
    idx, top_x = torch.where(expert_mask[:,:,token_idx])

    current_state = hidden_states[token_idx] 
    for ii in range(top_k):
        current_hidden_state = experts[idx[ii]](current_state) * routing_weights[token_idx, top_x[ii]]
        current_hidden_state = current_hidden_state.reshape(1,-1)
        final_hidden_states.index_add_(0, torch.tensor(token_idx), current_hidden_state.to(hidden_states.dtype))

defdet Feb 6, 2024
Author

Hey @clintg6,
Thanks for an idea. However, it feels like since we're processing much less tokens at a time, the speed will degrade substantially. Worth a try though.
Another approach might be to make idx, top_x point to zeroed values in arrays to sort of pad it.

clintg6 Feb 6, 2024

Yes the speed will take a hit for sure but great for getting the forward pass to generate something sensible and compare with a faster padding approach. Have you had any luck with padding?

defdet Feb 6, 2024
Author

I implemented something that sounds like a corect idea, but the model generates gibberish. Gotta start comparing values of arrays in jax and pytorch I guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implementing Mixtral 8x7B in JAX #19151

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Implementing Mixtral 8x7B in JAX #19151

Uh oh!

Uh oh!

defdet Jan 1, 2024

Replies: 1 comment · 4 replies

Uh oh!

Uh oh!

defdet Jan 1, 2024 Author

Uh oh!

clintg6 Feb 6, 2024

Uh oh!

defdet Feb 6, 2024 Author

Uh oh!

clintg6 Feb 6, 2024

Uh oh!

Uh oh!

defdet Feb 6, 2024 Author

defdet
Jan 1, 2024

Replies: 1 comment 4 replies

defdet
Jan 1, 2024
Author

defdet Feb 6, 2024
Author

defdet Feb 6, 2024
Author