Skip to content

Adds support for normalizing cuboid data before classifiying#588

Open
matham wants to merge 1 commit intobrainglobe:mainfrom
matham:caching_normal
Open

Adds support for normalizing cuboid data before classifiying#588
matham wants to merge 1 commit intobrainglobe:mainfrom
matham:caching_normal

Conversation

@matham
Copy link
Copy Markdown
Contributor

@matham matham commented Feb 12, 2026

Description

This PR depends on #493. Only the last commit in this PR is unique to this PR.

What is this PR

  • Bug fix
  • Addition of a new feature
  • Other

Why is this PR needed?

It's fairly standard when distributing pre-trained models, in particular image networks, to indicate how the input should be normalized, if at all. E.g. mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] has been used for ImageNet based models because that was the statistics of the ImageNet data. The reason is that the model will be more accurate if your data's statistics/input range matches the pre-trained dataset's statistics.

The pre-trained model distributed by brainglobe was trained on cells with implicit intensity statistics. So it would be nice if when using on other datasets we can normalize our own data to match that distribution. But it is not specified. Additionally, input data can vary wildly in intensity between channels and labeling types and so if you do transfer learning and want to provide a minimal set of new cells to train on, if you happen to fine tune on cells from images where the overall image is bright, trying to use it on images that are darker may result in worse classification.

When training on my own cells, I found that adding such normalization helped improve accuracy as you can see in the images below.

What does this PR do?

This PR add the option to normalize the training data, the classification data, and to save the image statistics into the yaml files when curating training data. Specifically,

  1. During curation, it'll sample the z-stack of each channel and compute the mean/std and dump that into the "curated" yaml file. E.g. here's how the yaml file will look:
    data:
    - bg_channel: 1
      cell_def: ''
      cube_dir: tests/data/integration/training/cells
      signal_channel: 0
      type: cell
      signal_mean: 241.31
      signal_std: 154.92
      bg_mean: 650.94
      bg_std: 217.90
    It adds the parameters normalize_channels: bool = False and normalization_down_sampling: int = 32 to control this.
  2. During training, we load these yaml files, and if selected, we normalize the cuboids by the mean/std of the dataset from which the cuboids originated. After training, the model expect input cuboids to have a mean of 1 and std of 1. This is controlled by the parameter normalize_channels: bool = False.
  3. During classification, we similarly sample each channel in the z-stack and normalize the cuboids that want to classify by the mean/std of the corresponding channel. This is similarly controlled by the parameters normalize_channels: bool = False and normalization_down_sampling: int = 32.

By default all the normalization is turned OFF.

References

#493.

How has this PR been tested?

I tested this on my own curated data that was continued training from the pre-trained brainglobe models. As I was getting cellfinder classification to work on our c-fos and fostrap images I found the model wasn't performing good enough. So I added normalization which improved the model.

I re-did those experiments systematically on our training data just to illustrate the benefits. It seems that normalization is most helpful when you have less training data and when you're using a smaller learning rate. Both makes sense intuitively. For the former it's because if you don't have a lot of training data and they all have different statistics, the model has a harder time converging. For the same reason, a lower learning rate will likely get more easily stuck in local minima.

Here's the plotted train and test accuracy and loss for training on my data:

There are 4 plots, illustrating two different learning rates (LR = 1e-3 and 1e-4), and two different values for the size of the training data (f = 10% and 50%, meaning the fraction of the training data used for testing - this is a quick way to reduce training data size by using 50% for testing). N=n indicates the number of lines in the plot, which indicates the number of independent replications of the training. FYI, each replication uses a different train/test split, which explains the distribution of lines for a given condition.

But overall you can see how normalizing improves model performance, and this is only using training data from similar images (data augmentation was 25% using resnet50_tv using a batch size of 32 and with --continue-training).

train_accuracy
test_accuracy
train_loss
test_loss

On the other hand, I tried to estimate the serial2p data's statistics from the provided cubes (already a problem because we want the statistics of the images they came from, not the extracted cubes) and saw little benefit when re-training the brainglobe model. I think this is because there's so much data in the training set that it really doesn't matter. Or the serial2p data are easier/cleaner to classify.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

There are new parameters. But, I think it may be worth to release a new pre-trained brainglobe model trained on normalized inputs.

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality (unit & integration)
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants