-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Hi, and thanks for the great release. With the released paper weights, I can reproduce the reported HEST-Bench numbers on my side. However, when I train from scratch using the repo defaults (with the provided filtering), my scores are noticeably lower than in the paper. Could you help confirm which non-default training settings are required to reach the paper numbers?
Questions
-
Best checkpoint metric - Did the paper select lowest val MSE (vs. Pearson)? The script picks best by Pearson; should I switch to val MSE for faithful reproduction?
-
GPU/batch -Your paper notes an 8-GPU environment—did training actually use all 8 GPUs with batch size=2 per GPU (effective 16), or a single GPU with global batch size=2?
-
Species filter - For pretraining, is human-only the intended setting (my slide count aligns with the paper when using human-only), or should mouse be included as well?
-
Validation split - The code uses utils_data/val_sample.txt, which seems HEST-1k-only; since pretraining uses HEST + STImage1K4M, did you validate only on HEST-1k, or also hold out from STImage1K4M (if so, how were those IDs/counts constructed)?
-
Anything else to reproduce weights - Beyond the defaults, are there any required non-default settings or preprocessing steps needed to reproduce the released weights?