I am working on reproducing your Planetarium results and have discovered a potential issue in the publicly released code.
The evaluation pipeline (evaluate.py) passes LLM outputs directly to equivalence() without preprocessing. However, when I run fine-tuned Gemma 2 models via SGLang, the raw outputs cause Lark parsing failures and prevent me from reproducing the reported Gemma 2-2B and Gemma 2-9B results.
- Does the reported evaluation apply any preprocessing before calling equivalence()?
- Could the result differences be due to different prompting strategies or model configurations between your setup and the public repository?
- Is there a specific commit/branch that should be used to reproduce the reported results?
Any clarification would help ensure reproducibility.