Could the author share the video split process in detail?

### System Info

-

### Reproduction

-

### Expected behavior

The paper only mention it as Annotators or pretrained VLMs are instructed to segment videos. The pretrained VLM means you finetune a VLM only for video spliting? Or only use the general VLM? If the latter, could you share how to prompt it?