Did you try extending instance segmentation to global (image/video) classification, i.e. predict the class of the image from segmented instances?