System Info
Reproduction
Expected behavior
The paper only mention it as Annotators or pretrained VLMs are instructed to segment videos. The pretrained VLM means you finetune a VLM only for video spliting? Or only use the general VLM? If the latter, could you share how to prompt it?