Ensemble Performance worse than individual model? #3967
-
Hi guys, I'm training models for a 10 fold cross validation and later using those 10 models in an ensemble to improve performance on a test set. For the training I'm taking a subset of 200 samples from BraTS2021 with 20 samples in each fold ie 180 training and 20 val. In order to confirm that it wasn't being caused due to the overlap between the training samples, I also swapped it around so that each set had 20 samples for training and 180 for val ensuring no overlap at all in training. Is this loss of performance to be expected? Below is a sample of the code I'm trying to use to ensemble:
|
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Hi @yiheng-wang-nv , Could you please help verify our tutorial about this question? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
Hi @bip5 , I checked the mean ensemble, and used the Spleen dataset to have a try (I refered to the spleen tutorial) I split the training set (32 images) into 4 parts and trained 4 models. Where each model is trained by 8 images, and they are using the same validation set.
And when I used In addition, I also tried to manually write an ensemble loop (do mean aggregation for all predicted logits), and the achieved score is the same as using Therefore, I think the behavior for this engine is expected. If possible, could you please also try to manually do the ensemble step to see the results? You can refer to the attached file which is used on my side. |
Beta Was this translation helpful? Give feedback.
-
Hi @Nic-Ma , when I checked the |
Beta Was this translation helpful? Give feedback.
Hi @bip5 , I checked the mean ensemble, and used the Spleen dataset to have a try (I refered to the spleen tutorial)
I split the training set (32 images) into 4 parts and trained 4 models. Where each model is trained by 8 images, and they are using the same validation set.
The performance for each model is:
And when I used
EnsembleEvaluator
withMeanEnsembled
, the score is 0.8378031253814697, which is better than all single models. UsingVoteEnsembled
got 0.810712456703186 which performs worse than the best single model.In addition, I also t…