Hello, author. May I ask whether it is necessary to distinguish the features of training set or test set when extracting multi-modal features?

Because I saw that all the features you have extracted in your code are a file, which is to directly extract all the features of the video into an HDF5 file, without distinguishing between the training set and the test set. Hope you can extract valuable events to answer the question.