Skip to content

Unable to Replicate Reported Zero-Shot and Fine-Tune Results #17

@leotompson

Description

@leotompson

Congratulations on your paper being accepted at NAACL! This is a very meaningful benchmark, and I’d like to follow it. However, while attempting to replicate the zero-shot and fine-tuning results of Gemma-2-9b-it using trl.SFTTrainer, I encountered significantly lower results than those reported in the paper. I’m using the following prompt format for zero-shot replication with vLLM:

    problem_prompt = (
        "Provide me with the complete, valid problem PDDL file that "
        "describes the following planning problem directly without further "
        "explanations or texts."
    )
    domain_prompt = "The domain for the planning problem is:"

    formatted_prompts = []
    for nl, domain in zip(natural_language_texts, domain_texts):
        messages = [
            {
                "role": "user",
                "content": (
                    f"{problem_prompt} {nl} "
                    f"{domain_prompt} {domain}"
                ),
            },
        ]

Could you share the relevant code for zero-shot evaluation and fine-tuning, along with the corresponding start commands?
I’ve attempted to replicate using finetune.py, but was unsuccessful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions