Question on counting FLOPs for DyT

Hi! I was reproducing your experiments on counting the FLOPs for DyT, and have encountered a question with https://github.com/NUS-HPC-AI-Lab/Dynamic-Tuning/blob/d1744f0b9366f79ad9b78f586e479af34e81807a/models/vision_transformer_IN21K.py#L180C73-L180C89.

In the article, DyT chooses to activate each token with the sigmoid activation output value. During training/inference, the TokenSelect class returns a full-sized token selection matrix that disables some tokens from dense calculation. If I interpret this correctly, the activated tokens could be sparse in the token sequence. However, in the referenced line above, when calculating FLOPs, the implementation simply regards the prefix of the token sequence as the activated tokens, and forward them with the dense calculation. In my opinion, such convenience might not exist in actual fine-tuning, thus questioning the validity of the FLOPs calculation. Is that the case for DyT? Or are there actually some means aggregating the activated tokens to the prefix?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question on counting FLOPs for DyT #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on counting FLOPs for DyT #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions