-
Notifications
You must be signed in to change notification settings - Fork 70
Open
Description
I have been trying to use TimeSformer and ViViT, I have managed to convert it into a regression model by changing the loss function and setting the output of the mlp to 1. However what i understand is that a video vision transformer takes a video clip as an input(broken into frames) and outputs a single value corresponding to that input clip. I would like the model to output a value for each frame of the clip input so instead of outputing 1 value it outputs 32 values. Can you guide me in this regards.
Metadata
Metadata
Assignees
Labels
No labels