Adds support for normalizing cuboid data before classifiying by matham · Pull Request #588 · brainglobe/cellfinder

matham · 2026-02-12T02:28:06Z

Description

This PR depends on #493. Only the last commit in this PR is unique to this PR.

What is this PR

Bug fix
Addition of a new feature
Other

Why is this PR needed?

It's fairly standard when distributing pre-trained models, in particular image networks, to indicate how the input should be normalized, if at all. E.g. mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] has been used for ImageNet based models because that was the statistics of the ImageNet data. The reason is that the model will be more accurate if your data's statistics/input range matches the pre-trained dataset's statistics.

The pre-trained model distributed by brainglobe was trained on cells with implicit intensity statistics. So it would be nice if when using on other datasets we can normalize our own data to match that distribution. But it is not specified. Additionally, input data can vary wildly in intensity between channels and labeling types and so if you do transfer learning and want to provide a minimal set of new cells to train on, if you happen to fine tune on cells from images where the overall image is bright, trying to use it on images that are darker may result in worse classification.

When training on my own cells, I found that adding such normalization helped improve accuracy as you can see in the images below.

What does this PR do?

This PR add the option to normalize the training data, the classification data, and to save the image statistics into the yaml files when curating training data. Specifically,

During curation, it'll sample the z-stack of each channel and compute the mean/std and dump that into the "curated" yaml file. E.g. here's how the yaml file will look:
```
data:
- bg_channel: 1
  cell_def: ''
  cube_dir: tests/data/integration/training/cells
  signal_channel: 0
  type: cell
  signal_mean: 241.31
  signal_std: 154.92
  bg_mean: 650.94
  bg_std: 217.90
```
It adds the parameters normalize_channels: bool = False and normalization_down_sampling: int = 32 to control this.
During training, we load these yaml files, and if selected, we normalize the cuboids by the mean/std of the dataset from which the cuboids originated. After training, the model expect input cuboids to have a mean of 1 and std of 1. This is controlled by the parameter normalize_channels: bool = False.
During classification, we similarly sample each channel in the z-stack and normalize the cuboids that want to classify by the mean/std of the corresponding channel. This is similarly controlled by the parameters normalize_channels: bool = False and normalization_down_sampling: int = 32.

By default all the normalization is turned OFF.

References

#493.

How has this PR been tested?

I tested this on my own curated data that was continued training from the pre-trained brainglobe models. As I was getting cellfinder classification to work on our c-fos and fostrap images I found the model wasn't performing good enough. So I added normalization which improved the model.

I re-did those experiments systematically on our training data just to illustrate the benefits. It seems that normalization is most helpful when you have less training data and when you're using a smaller learning rate. Both makes sense intuitively. For the former it's because if you don't have a lot of training data and they all have different statistics, the model has a harder time converging. For the same reason, a lower learning rate will likely get more easily stuck in local minima.

Here's the plotted train and test accuracy and loss for training on my data:

There are 4 plots, illustrating two different learning rates (LR = 1e-3 and 1e-4), and two different values for the size of the training data (f = 10% and 50%, meaning the fraction of the training data used for testing - this is a quick way to reduce training data size by using 50% for testing). N=n indicates the number of lines in the plot, which indicates the number of independent replications of the training. FYI, each replication uses a different train/test split, which explains the distribution of lines for a given condition.

But overall you can see how normalizing improves model performance, and this is only using training data from similar images (data augmentation was 25% using resnet50_tv using a batch size of 32 and with --continue-training).

On the other hand, I tried to estimate the serial2p data's statistics from the provided cubes (already a problem because we want the statistics of the images they came from, not the extracted cubes) and saw little benefit when re-training the brainglobe model. I think this is because there's so much data in the training set that it really doesn't matter. Or the serial2p data are easier/cleaner to classify.

Is this a breaking change?

No.

Does this PR require an update to the documentation?

There are new parameters. But, I think it may be worth to release a new pre-trained brainglobe model trained on normalized inputs.

Checklist:

The code has been tested locally
Tests have been added to cover all new functionality (unit & integration)
The documentation has been updated to reflect any changes
The code has been formatted with pre-commit

cellfinder/core/classify/cube_generator.py

tests/core/test_unit/test_classify/test_data_augment.py

github-advanced-security bot found potential problems Feb 12, 2026

View reviewed changes

matham mentioned this pull request Feb 12, 2026

Adds support for varying learning rate during training #589

Open

7 tasks

alessandrofelder added this to Core development Mar 12, 2026

github-project-automation bot moved this to Priorities in Core development Mar 12, 2026

matham force-pushed the caching_normal branch from 01bb17f to 9f726ba Compare March 12, 2026 20:09

Add support for dataset normalization.

59db9a4

matham force-pushed the caching_normal branch from 9f726ba to 59db9a4 Compare March 12, 2026 20:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for normalizing cuboid data before classifiying#588

Adds support for normalizing cuboid data before classifiying#588
matham wants to merge 1 commit intobrainglobe:mainfrom
matham:caching_normal

matham commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

matham commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

References

How has this PR been tested?

Is this a breaking change?

Does this PR require an update to the documentation?

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matham commented Feb 12, 2026 •

edited

Loading