I want to use it in some real scenes, can this deep net work well on other datasets, such as VOC? Will it be trained well on non-video datasets?