Skip to content

Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting #64

@passing2961

Description

@passing2961

Hi,

When running the GSM8K evaluation experiments using an 8-shot setting, I noticed that the few-shot examples were not successfully applied. Specifically, the current implementation triggers (in here).

if len(demo_prompt) == 0 or (
        args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
    ):
        full_prompt = context
    else:
        if args.prompt_type == "qwen25-math-cot":
            # Hotfix to supportting put all demos into a single turn
            full_prompt = demo_prompt + splitter + example["question"]
            full_prompt = input_template.format(input=full_prompt)
        else:
            full_prompt = demo_prompt + splitter + context

Since GSM8K has numeric ground-truth answers (gt_ans) instead of multiple-choice letters ("A", "B", "C", "D", "E"), the above condition evaluates to True, causing the few-shot demonstrations (demo_prompt) to be skipped unintentionally. As a result, the model receives only the current question context without the desired few-shot examples. I used the following evaluation script (LLaMA-3.2-1B-Instruct):

set -ex

PROMPT_TYPE=$1
MODEL_NAME_OR_PATH=$2
SAVE_DIR=$3
OUTPUT_DIR=${SAVE_DIR}/math_eval

SPLIT="test"
NUM_TEST_SAMPLE=-1

# English open datasets
DATA_NAME="gsm8k,math"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --seed 0 \
    --temperature 0 \
    --n_sampling 1 \
    --top_p 1 \
    --start 0 \
    --end -1 \
    --use_vllm \
    --save_outputs \
    --overwrite \
    --adapt_few_shot \
    --num_shots 8 \
    --apply_chat_template

Proposed Solution

I modified the conditional logic as follows, and verified that the few-shot prompts are correctly applied for GSM8K:

if data_name != "gsm8k" and (
      len(demo_prompt) == 0 or (
          args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
      )
  ):
        full_prompt = context
  else:
      if args.prompt_type == "qwen25-math-cot":
          # Hotfix to supportting put all demos into a single turn
          full_prompt = demo_prompt + splitter + example["question"]
          full_prompt = input_template.format(input=full_prompt)
      else:
          full_prompt = demo_prompt + splitter + context

Question

Would it be appropriate to officially integrate this adjustment? Or might I have missed some intended behavior regarding the few-shot logic?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions