Ensemble Performance worse than individual model? #3967

bip5 · 2022-03-18T23:48:14Z

bip5
Mar 18, 2022

Hi guys,

I'm training models for a 10 fold cross validation and later using those 10 models in an ensemble to improve performance on a test set.
Unfortunately, every single model has a better performance than the ensemble. I've tried both vote ensemble and mean ensemble.

For the training I'm taking a subset of 200 samples from BraTS2021 with 20 samples in each fold ie 180 training and 20 val. In order to confirm that it wasn't being caused due to the overlap between the training samples, I also swapped it around so that each set had 20 samples for training and 180 for val ensuring no overlap at all in training. Is this loss of performance to be expected?

Below is a sample of the code I'm trying to use to ensemble:

        def ensemble_evaluate(post_transforms, models):
            print(post_transforms.transforms)
            evaluator = EnsembleEvaluator(
                device=device,
                val_data_loader=test_loader, #test dataloader - this is loading all 5 sets of data
                pred_keys=["pred"+str(i) for i in range(10)], 
                networks=models, # models defined above
                inferer=SlidingWindowInferer(
                    roi_size=(96, 96, 96), sw_batch_size=4, overlap=0.5),
                postprocessing=post_transforms, # this is going to call post_transforms based on type of ensemble
                
                key_val_metric={
                    "test_mean_dice": MeanDice(
                        include_background=True,
                        output_transform=from_engine(["pred", "label"])  # takes all the preds and labels and turns them into one list each
                        
                    )},
                additional_metrics={ 
                    "Channelwise": MeanDice(
                    include_background=True,
                    output_transform=from_engine(["pred", "label"]),
                    reduction="mean_batch")
                }
            )
            evaluator.run()

        vote_post_transforms = Compose(
            [
                EnsureTyped(keys=["pred"+str(i) for i in range(10)]),
                Activationsd(keys=["pred"+str(i) for i in range(10)], sigmoid=True),
                # transform data into discrete before voting
                AsDiscreted(keys=["pred"+str(i) for i in range(10)], threshold=0.3),
                VoteEnsembled(keys=["pred"+str(i) for i in range(10)], output_key="pred"),
            ]
        )
        ensemble_evaluate(vote_post_transforms, models)

Answered by yiheng-wang-nv

Mar 21, 2022

Hi @bip5 , I checked the mean ensemble, and used the Spleen dataset to have a try (I refered to the spleen tutorial)

I split the training set (32 images) into 4 parts and trained 4 models. Where each model is trained by 8 images, and they are using the same validation set.
The performance for each model is:

Metric of fold 0: 0.7774583101272583
Metric of fold 1: 0.8035825490951538
Metric of fold 2: 0.8150454163551331
Metric of fold 3: 0.7605332732200623

And when I used EnsembleEvaluator with MeanEnsembled, the score is 0.8378031253814697, which is better than all single models. Using VoteEnsembled got 0.810712456703186 which performs worse than the best single model.

In addition, I also t…

View full answer

Nic-Ma · 2022-03-20T07:30:55Z

Nic-Ma
Mar 20, 2022
Maintainer

Hi @yiheng-wang-nv ,

Could you please help verify our tutorial about this question?
https://github.com/Project-MONAI/tutorials/blob/master/modules/cross_validation_models_ensemble.ipynb
Maybe we have some bug or it's expected behavior that ensemble is not always useful?

Thanks in advance.

0 replies

yiheng-wang-nv · 2022-03-21T14:58:41Z

yiheng-wang-nv
Mar 21, 2022
Collaborator

Hi @bip5 , I checked the mean ensemble, and used the Spleen dataset to have a try (I refered to the spleen tutorial)

I split the training set (32 images) into 4 parts and trained 4 models. Where each model is trained by 8 images, and they are using the same validation set.
The performance for each model is:

Metric of fold 0: 0.7774583101272583
Metric of fold 1: 0.8035825490951538
Metric of fold 2: 0.8150454163551331
Metric of fold 3: 0.7605332732200623

And when I used EnsembleEvaluator with MeanEnsembled, the score is 0.8378031253814697, which is better than all single models. Using VoteEnsembled got 0.810712456703186 which performs worse than the best single model.

In addition, I also tried to manually write an ensemble loop (do mean aggregation for all predicted logits), and the achieved score is the same as using EnsembleEvaluator.

Therefore, I think the behavior for this engine is expected. If possible, could you please also try to manually do the ensemble step to see the results? You can refer to the attached file which is used on my side.

spleen_fast_4_splits.ipynb.zip

4 replies

bip5 Mar 21, 2022
Author

Hi @yiheng-wang-nv ,

Thank you for the confirmation and the example to go with it.

After some debugging I noticed the engine is reporting the dice score of the final model in the list as the dice score for all of the ensemble.

EDIT: I think I might have the source of my confusion:

print("Mean Dice:",evaluator.state.metrics['test_mean_dice'],"metric_tc:",float(evaluator.state.metrics["Channelwise"][0]),"whole tumor:",float(evaluator.state.metrics["Channelwise"][1]),"enhancing tumor:",float(evaluator.state.metrics["Channelwise"][2]))

I'm assuming the evaluator state here must be overwritten each time a model in the networks list is processed. EDIT: This wasn't the case. It's fine to access the engine metrics like this to get the right summary. The root cause was not initialising the model each time before loading weights to pass to the engine.

Is there a way to access below information(and additional metrics through a variable) ? It doesn't seem to leave any output if the evaluator.run() is wrapped inside a function.

INFO:ignite.engine.engine.EnsembleEvaluator:Engine run resuming from iteration 0, epoch 0 until 1 epochs
INFO:ignite.engine.engine.EnsembleEvaluator:Got new best metric of mean_dice: 0.8378031253814697
INFO:ignite.engine.engine.EnsembleEvaluator:Epoch[1] Complete. Time taken: 00:00:30
INFO:ignite.engine.engine.EnsembleEvaluator:Engine run complete. Time taken: 00:00:30

yiheng-wang-nv Mar 22, 2022
Collaborator

Hi @bip5 ,

After some debugging I noticed the engine is reporting the dice score of the final model in the list as the dice score for all of the ensemble.

Do you mean the score of the ensemble you thought is actually the score of the final model? Thus does the correct score perform better?

Is there a way to access below information(and additional metrics through a variable) ?

You may need to set:

import logging, sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)

bip5 Mar 22, 2022
Author

Do you mean the score of the ensemble you thought is actually the score of the final model? Thus does the correct score perform better?

Unfortunately, I have not yet been able to access the final score. Even when I tried to use your manual method the problem of only the final model score being reported as the dice score persisted.

I have a suspicion it is because I don't define the model each time in the loop like you did in the notebook. I'm trying it now, will keep you updated.

bip5 Mar 22, 2022
Author

I have a suspicion it is because I don't define the model each time in the loop like you did in the notebook. I'm trying it now, will keep you updated.

I was right about this. All working now - the ensemble dice score of 0.74 is better than the best model which I think is 0.72. Thank you for your help!

yiheng-wang-nv · 2022-03-21T15:00:22Z

yiheng-wang-nv
Mar 21, 2022
Collaborator

Hi @Nic-Ma , when I checked the EnsembleEvaluator, it seems there is a bug when len(networks) = 1. Let me create a separate ticket to discuss about that issue.

1 reply

yiheng-wang-nv Mar 22, 2022
Collaborator

Hi @Nic-Ma , it is not a bug according to the docstrings of MeanEnsembled, where for 1 model's case, one more dimension E is needed. Therefore, when I used AddChanneld before this transform the program works fine.

Ensemble Performance worse than individual model? #3967

Uh oh!

bip5 Mar 18, 2022

Replies: 3 comments · 5 replies

Uh oh!

Nic-Ma Mar 20, 2022 Maintainer

Uh oh!

Uh oh!

yiheng-wang-nv Mar 21, 2022 Collaborator

Uh oh!

Uh oh!

bip5 Mar 21, 2022 Author

Uh oh!

Uh oh!

yiheng-wang-nv Mar 22, 2022 Collaborator

Uh oh!

Uh oh!

bip5 Mar 22, 2022 Author

Uh oh!

Uh oh!

bip5 Mar 22, 2022 Author

Uh oh!

yiheng-wang-nv Mar 21, 2022 Collaborator

Uh oh!

yiheng-wang-nv Mar 22, 2022 Collaborator

bip5
Mar 18, 2022

Replies: 3 comments 5 replies

Nic-Ma
Mar 20, 2022
Maintainer

yiheng-wang-nv
Mar 21, 2022
Collaborator

bip5 Mar 21, 2022
Author

yiheng-wang-nv Mar 22, 2022
Collaborator

bip5 Mar 22, 2022
Author

bip5 Mar 22, 2022
Author

yiheng-wang-nv
Mar 21, 2022
Collaborator

yiheng-wang-nv Mar 22, 2022
Collaborator