Adds support for normalizing cuboid data before classifiying#588
Open
matham wants to merge 1 commit intobrainglobe:mainfrom
Open
Adds support for normalizing cuboid data before classifiying#588matham wants to merge 1 commit intobrainglobe:mainfrom
matham wants to merge 1 commit intobrainglobe:mainfrom
Conversation
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR depends on #493. Only the last commit in this PR is unique to this PR.
What is this PR
Why is this PR needed?
It's fairly standard when distributing pre-trained models, in particular image networks, to indicate how the input should be normalized, if at all. E.g.
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]has been used for ImageNet based models because that was the statistics of the ImageNet data. The reason is that the model will be more accurate if your data's statistics/input range matches the pre-trained dataset's statistics.The pre-trained model distributed by brainglobe was trained on cells with implicit intensity statistics. So it would be nice if when using on other datasets we can normalize our own data to match that distribution. But it is not specified. Additionally, input data can vary wildly in intensity between channels and labeling types and so if you do transfer learning and want to provide a minimal set of new cells to train on, if you happen to fine tune on cells from images where the overall image is bright, trying to use it on images that are darker may result in worse classification.
When training on my own cells, I found that adding such normalization helped improve accuracy as you can see in the images below.
What does this PR do?
This PR add the option to normalize the training data, the classification data, and to save the image statistics into the yaml files when curating training data. Specifically,
normalize_channels: bool = Falseandnormalization_down_sampling: int = 32to control this.normalize_channels: bool = False.normalize_channels: bool = Falseandnormalization_down_sampling: int = 32.By default all the normalization is turned OFF.
References
#493.
How has this PR been tested?
I tested this on my own curated data that was continued training from the pre-trained brainglobe models. As I was getting cellfinder classification to work on our c-fos and fostrap images I found the model wasn't performing good enough. So I added normalization which improved the model.
I re-did those experiments systematically on our training data just to illustrate the benefits. It seems that normalization is most helpful when you have less training data and when you're using a smaller learning rate. Both makes sense intuitively. For the former it's because if you don't have a lot of training data and they all have different statistics, the model has a harder time converging. For the same reason, a lower learning rate will likely get more easily stuck in local minima.
Here's the plotted train and test accuracy and loss for training on my data:
There are 4 plots, illustrating two different learning rates (
LR=1e-3and1e-4), and two different values for the size of the training data (f=10%and50%, meaning the fraction of the training data used for testing - this is a quick way to reduce training data size by using50%for testing).N=nindicates the number of lines in the plot, which indicates the number of independent replications of the training. FYI, each replication uses a different train/test split, which explains the distribution of lines for a given condition.But overall you can see how normalizing improves model performance, and this is only using training data from similar images (data augmentation was
25%usingresnet50_tvusing a batch size of32and with--continue-training).On the other hand, I tried to estimate the serial2p data's statistics from the provided cubes (already a problem because we want the statistics of the images they came from, not the extracted cubes) and saw little benefit when re-training the brainglobe model. I think this is because there's so much data in the training set that it really doesn't matter. Or the serial2p data are easier/cleaner to classify.
Is this a breaking change?
No.
Does this PR require an update to the documentation?
There are new parameters. But, I think it may be worth to release a new pre-trained brainglobe model trained on normalized inputs.
Checklist: