Skip to content

Performance reproduction issue #4

@Minsoo2022

Description

@Minsoo2022

Hi, and thanks for the great release. With the released paper weights, I can reproduce the reported HEST-Bench numbers on my side. However, when I train from scratch using the repo defaults (with the provided filtering), my scores are noticeably lower than in the paper. Could you help confirm which non-default training settings are required to reach the paper numbers?

Questions

  1. Best checkpoint metric - Did the paper select lowest val MSE (vs. Pearson)? The script picks best by Pearson; should I switch to val MSE for faithful reproduction?

  2. GPU/batch -Your paper notes an 8-GPU environment—did training actually use all 8 GPUs with batch size=2 per GPU (effective 16), or a single GPU with global batch size=2?

  3. Species filter - For pretraining, is human-only the intended setting (my slide count aligns with the paper when using human-only), or should mouse be included as well?

  4. Validation split - The code uses utils_data/val_sample.txt, which seems HEST-1k-only; since pretraining uses HEST + STImage1K4M, did you validate only on HEST-1k, or also hold out from STImage1K4M (if so, how were those IDs/counts constructed)?

  5. Anything else to reproduce weights - Beyond the defaults, are there any required non-default settings or preprocessing steps needed to reproduce the released weights?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions