failure and errors running finetune_hela.py

Dear micro-sam developers,

I'm Ziqiang Huang, image analyst at the Imaging Centre at EMBL Heidelberg. I'm trying to use micro-sam to segment some single channel flurorescent images of some special cell type, that has morphology like microglia and quite irregular in shape, size, and morphology. 

I created a set of training data, 54 images in total, all 650 by 650 pixel 2D (originally 16-bit image), and converted to 8-bit. I created their corresponding masks with micro-sam 2D annotator. 
I'm making use of the finetune_hela.py script in finetuning example to train my own model. However despite I tried myself to fix a few places, it still doesn't work. I found some place in the code it requires only 8-bit image? And some place it needs all the input data to be the same size? 
I changed in finetune_hela.py: 
line 34-35: the path of image_dir and segmentation_dir to my input data folder and mask folder;
line 54, 56: roi = np.s_[:30, :, :] and roi= np.s_[30:, :, :]
line 83: patch_shape = (1, 650, 650) # is this necessar?
line 84: n_objects_per_batch = 1 # I have various number of objects in each of the input image/batch, but majority has 1, I tried both keep all of them, and remove all but 1, and neither worked
line 124: model_type = "vit_b_lm"

I got the environment setup correctly with the instruction and can use the cuda device successfully. 

When running the script I got initially the training OK, but reporting during training looks wrong with "current metric" always be 'nan', and 'best metric' always being 'inf'

More detailed output from the console I pasted it here, for your reference:

```bash
(sam) C:\Users\...\micro-sam\examples\finetuning>python -m finetune_hela

Verifying labels in 'train' dataloader:  36%|██████████████████████████████████████▉                                                                     | 18/50 [00:24<00:43,  1.36s/it]
Verifying labels in 'val' dataloader:  72%|███████████████████████████████████████████████████████████████████████████████▏                              | 36/50 [00:25<00:09,  1.44it/s]
Start fitting for 1800 iterations /  100 epochs
with 18 iterations per epoch
Training with mixed precision
Epoch 11: average [s/it]: 39.558808, current metric: nan, best metric: inf:  12%|███████▋                                                        | 216/1800 [4:52:37<16:46:31, 38.13s/it]Stopping training because there has been no improvement for 10 epochs
Finished training after 11 epochs / 216 iterations.
The best epoch is number 0.
Epoch 11: average [s/it]: 39.558808, current metric: nan, best metric: inf:  12%|███████▋                                                        | 216/1800 [5:02:29<36:58:18, 84.03s/it]
Training took 18249.626536607742 seconds (= 05:304:10 hours)
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 141, in <module>
    main()
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 137, in main
    export_model(checkpoint_name, model_type)
  File "C:\Users\...\micro-sam\examples\finetuning\finetune_hela.py", line 113, in export_model
    export_custom_sam_model(
  File "C:\Users\...\micro-sam\micro_sam\util.py", line 508, in export_custom_sam_model
    _, state = get_sam_model(
               ^^^^^^^^^^^^^^
  File "C:\Users\...\micro-sam\micro_sam\util.py", line 376, in get_sam_model
    raise ValueError(f"Checkpoint at {checkpoint_path} could not be found.")
ValueError: Checkpoint at checkpoints\sam_silvia3\best.pt could not be found.
```

I attachd a image and mask pair for your reference:

<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/c2c6395a-d6db-4a10-bff0-83165111ef9b" />
<img width="512" height="512" alt="Image" src="https://github.com/user-attachments/assets/f186d74d-aad2-4cb5-a0df-27d2dd0908d6" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

failure and errors running finetune_hela.py #1084

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

failure and errors running finetune_hela.py #1084

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions