Skip to content

Question on counting FLOPs for DyT #4

@Bostoncake

Description

@Bostoncake

Hi! I was reproducing your experiments on counting the FLOPs for DyT, and have encountered a question with https://github.com/NUS-HPC-AI-Lab/Dynamic-Tuning/blob/d1744f0b9366f79ad9b78f586e479af34e81807a/models/vision_transformer_IN21K.py#L180C73-L180C89.

In the article, DyT chooses to activate each token with the sigmoid activation output value. During training/inference, the TokenSelect class returns a full-sized token selection matrix that disables some tokens from dense calculation. If I interpret this correctly, the activated tokens could be sparse in the token sequence. However, in the referenced line above, when calculating FLOPs, the implementation simply regards the prefix of the token sequence as the activated tokens, and forward them with the dense calculation. In my opinion, such convenience might not exist in actual fine-tuning, thus questioning the validity of the FLOPs calculation. Is that the case for DyT? Or are there actually some means aggregating the activated tokens to the prefix?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions