Fine tuning and end-to-end inference on a video.

I want to fine tune this pre-trained model for learning purpose on a small set of videos, but not able to proceed. If anyone can help me then it would be a great help. And also if possible a master virtual environment which can run entire caption generation in a single shot without changing environment. Thanks in advance.