Skip to content

[QUESTION] glu activation with tensor parallel in GroupedMLP #985

@Teng-xu

Description

@Teng-xu

Description:

When training with GroupedMLP and Tensor Parallel (TP) enabled, and gated_linear_unit is activated, the activation function is applied to fc1_output. Assuming a TP degree of 2, this intermediate output only contains half of the information as it holds the tensor values on one TP rank. Applying the GLU activation function on this output leads to a loss of information because only half of the tensor values are involved in the activation function.

Specifically, in the GLU function (https://github.com/NVIDIA/Megatron-LM/blob/core_v0.7.0/megatron/core/transformer/moe/experts.py#L48):
self.config.activation_func(x[0]) * x[1]

Both self.config.activation_func(x[0]) and x[1] contain half of the output tensor due to TP being enabled, resulting in an output that does not match the results from training without TP.

Steps to Reproduce:

  1. Enable gated_linear_unit in the GroupedMLP configuration.
  2. Train the model with Tensor Parallel (TP) enabled.
  3. Compare the intermediate outputs of the GLU activation function with and without TP enabled. (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/moe/experts.py#L176)

Expected Behavior:

The activation function should correctly handle the tensor values across all TP ranks to prevent any loss of information, ensuring consistency with results obtained without TP.

Actual Behavior:

The GLU activation function is applied to tensor values that only represent half of the full tensor due to TP, leading to inconsistent results.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions