Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting

Hi, 

When running the GSM8K evaluation experiments using an 8-shot setting, I noticed that the few-shot examples were not successfully applied. Specifically, the current implementation triggers (in [here](https://github.com/QwenLM/Qwen2.5-Math/blob/a45202bd16f1ec06f433442dc1152d0074773465/evaluation/utils.py#L204C5-L214C59)).

```python
if len(demo_prompt) == 0 or (
        args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
    ):
        full_prompt = context
    else:
        if args.prompt_type == "qwen25-math-cot":
            # Hotfix to supportting put all demos into a single turn
            full_prompt = demo_prompt + splitter + example["question"]
            full_prompt = input_template.format(input=full_prompt)
        else:
            full_prompt = demo_prompt + splitter + context
```

Since GSM8K has numeric ground-truth answers (`gt_ans`) instead of multiple-choice letters ("A", "B", "C", "D", "E"), the above condition evaluates to `True`, causing the few-shot demonstrations (`demo_prompt`) to be skipped unintentionally. As a result, the model receives only the current question context without the desired few-shot examples. I used the following evaluation script (`LLaMA-3.2-1B-Instruct`):

```bash
set -ex

PROMPT_TYPE=$1
MODEL_NAME_OR_PATH=$2
SAVE_DIR=$3
OUTPUT_DIR=${SAVE_DIR}/math_eval

SPLIT="test"
NUM_TEST_SAMPLE=-1

# English open datasets
DATA_NAME="gsm8k,math"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --data_name ${DATA_NAME} \
    --output_dir ${OUTPUT_DIR} \
    --split ${SPLIT} \
    --prompt_type ${PROMPT_TYPE} \
    --num_test_sample ${NUM_TEST_SAMPLE} \
    --seed 0 \
    --temperature 0 \
    --n_sampling 1 \
    --top_p 1 \
    --start 0 \
    --end -1 \
    --use_vllm \
    --save_outputs \
    --overwrite \
    --adapt_few_shot \
    --num_shots 8 \
    --apply_chat_template
```

## Proposed Solution
I modified the conditional logic as follows, and verified that the few-shot prompts are correctly applied for GSM8K:
```python
if data_name != "gsm8k" and (
      len(demo_prompt) == 0 or (
          args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
      )
  ):
        full_prompt = context
  else:
      if args.prompt_type == "qwen25-math-cot":
          # Hotfix to supportting put all demos into a single turn
          full_prompt = demo_prompt + splitter + example["question"]
          full_prompt = input_template.format(input=full_prompt)
      else:
          full_prompt = demo_prompt + splitter + context
```

## Question
Would it be appropriate to officially integrate this adjustment? Or might I have missed some intended behavior regarding the few-shot logic?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting #64

Proposed Solution

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Few-shot prompt adaptation issue for GSM8K evaluation with 8-shot setting #64

Description

Proposed Solution

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions