Unable to Replicate Reported Zero-Shot and Fine-Tune Results

Congratulations on your paper being accepted at NAACL! This is a very meaningful benchmark, and I’d like to follow it. However, while attempting to replicate the zero-shot and fine-tuning results of Gemma-2-9b-it using trl.SFTTrainer, I encountered significantly lower results than those reported in the paper. I’m using the following prompt format for zero-shot replication with vLLM:
```
    problem_prompt = (
        "Provide me with the complete, valid problem PDDL file that "
        "describes the following planning problem directly without further "
        "explanations or texts."
    )
    domain_prompt = "The domain for the planning problem is:"

    formatted_prompts = []
    for nl, domain in zip(natural_language_texts, domain_texts):
        messages = [
            {
                "role": "user",
                "content": (
                    f"{problem_prompt} {nl} "
                    f"{domain_prompt} {domain}"
                ),
            },
        ]
```
Could you share the relevant code for zero-shot evaluation and fine-tuning, along with the corresponding start commands? 
I’ve attempted to replicate using [finetune.py](https://github.com/BatsResearch/planetarium/blob/main/finetune.py), but was unsuccessful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to Replicate Reported Zero-Shot and Fine-Tune Results #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to Replicate Reported Zero-Shot and Fine-Tune Results #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions