Skip to content

failure and errors running finetune_hela.py #1084

@hzqfox

Description

@hzqfox

Dear micro-sam developers,

I'm Ziqiang Huang, image analyst at the Imaging Centre at EMBL Heidelberg. I'm trying to use micro-sam to segment some single channel flurorescent images of some special cell type, that has morphology like microglia and quite irregular in shape, size, and morphology.

I created a set of training data, 54 images in total, all 650 by 650 pixel 2D (originally 16-bit image), and converted to 8-bit. I created their corresponding masks with micro-sam 2D annotator.
I'm making use of the finetune_hela.py script in finetuning example to train my own model. However despite I tried myself to fix a few places, it still doesn't work. I found some place in the code it requires only 8-bit image? And some place it needs all the input data to be the same size?
I changed in finetune_hela.py:
line 34-35: the path of image_dir and segmentation_dir to my input data folder and mask folder;
line 54, 56: roi = np.s_[:30, :, :] and roi= np.s_[30:, :, :]
line 83: patch_shape = (1, 650, 650) # is this necessar?
line 84: n_objects_per_batch = 1 # I have various number of objects in each of the input image/batch, but majority has 1, I tried both keep all of them, and remove all but 1, and neither worked
line 124: model_type = "vit_b_lm"

I got the environment setup correctly with the instruction and can use the cuda device successfully.

When running the script I got initially the training OK, but reporting during training looks wrong with "current metric" always be 'nan', and 'best metric' always being 'inf'

More detailed output from the console I pasted it here, for your reference:

(sam) C:\Users\...\micro-sam\examples\finetuning>python -m finetune_hela

Verifying labels in 'train' dataloader:  36%|██████████████████████████████████████▉                                                                     | 18/50 [00:24<00:43,  1.36s/it]
Verifying labels in 'val' dataloader:  72%|███████████████████████████████████████████████████████████████████████████████▏                              | 36/50 [00:25<00:09,  1.44it/s]
Start fitting for 1800 iterations /  100 epochs
with 18 iterations per epoch
Training with mixed precision
Epoch 11: average [s/it]: 39.558808, current metric: nan, best metric: inf:  12%|███████▋                                                        | 216/1800 [4:52:37<16:46:31, 38.13s/it]Stopping training because there has been no improvement for 10 epochs
Finished training after 11 epochs / 216 iterations.
The best epoch is number 0.
Epoch 11: average [s/it]: 39.558808, current metric: nan, best metric: inf:  12%|███████▋                                                        | 216/1800 [5:02:29<36:58:18, 84.03s/it]
Training took 18249.626536607742 seconds (= 05:304:10 hours)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 141, in <module>
    main()
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 137, in main
    export_model(checkpoint_name, model_type)
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 113, in export_model
    export_custom_sam_model(
  File "C:\Users\...\micro-sam\micro_sam\util.py", line 508, in export_custom_sam_model
    _, state = get_sam_model(
               ^^^^^^^^^^^^^^
  File "C:\Users\...\micro-sam\micro_sam\util.py", line 376, in get_sam_model
    raise ValueError(f"Checkpoint at {checkpoint_path} could not be found.")
ValueError: Checkpoint at checkpoints\sam_silvia3\best.pt could not be found.

I attachd a image and mask pair for your reference:

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions