Performance reproduction issue

Hi, and thanks for the great release. With the released paper weights, I can reproduce the reported HEST-Bench numbers on my side. However, when I train from scratch using the repo defaults (with the provided filtering), my scores are noticeably lower than in the paper. Could you help confirm which non-default training settings are required to reach the paper numbers?

Questions

1. Best checkpoint metric - Did the paper select lowest val MSE (vs. Pearson)? The script picks best by Pearson; should I switch to val MSE for faithful reproduction?

2. GPU/batch -Your paper notes an 8-GPU environment—did training actually use all 8 GPUs with batch size=2 per GPU (effective 16), or a single GPU with global batch size=2?


3. Species filter - For pretraining, is human-only the intended setting (my slide count aligns with the paper when using human-only), or should mouse be included as well?

4. Validation split - The code uses utils_data/val_sample.txt, which seems HEST-1k-only; since pretraining uses HEST + STImage1K4M, did you validate only on HEST-1k, or also hold out from STImage1K4M (if so, how were those IDs/counts constructed)?

5. Anything else to reproduce weights - Beyond the defaults, are there any required non-default settings or preprocessing steps needed to reproduce the released weights?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance reproduction issue #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance reproduction issue #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions