-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi! I was reproducing your experiments on counting the FLOPs for DyT, and have encountered a question with https://github.com/NUS-HPC-AI-Lab/Dynamic-Tuning/blob/d1744f0b9366f79ad9b78f586e479af34e81807a/models/vision_transformer_IN21K.py#L180C73-L180C89.
In the article, DyT chooses to activate each token with the sigmoid activation output value. During training/inference, the TokenSelect class returns a full-sized token selection matrix that disables some tokens from dense calculation. If I interpret this correctly, the activated tokens could be sparse in the token sequence. However, in the referenced line above, when calculating FLOPs, the implementation simply regards the prefix of the token sequence as the activated tokens, and forward them with the dense calculation. In my opinion, such convenience might not exist in actual fine-tuning, thus questioning the validity of the FLOPs calculation. Is that the case for DyT? Or are there actually some means aggregating the activated tokens to the prefix?
Thanks!