-
Notifications
You must be signed in to change notification settings - Fork 152
Description
Hi,
When running the GSM8K evaluation experiments using an 8-shot setting, I noticed that the few-shot examples were not successfully applied. Specifically, the current implementation triggers (in here).
if len(demo_prompt) == 0 or (
args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
):
full_prompt = context
else:
if args.prompt_type == "qwen25-math-cot":
# Hotfix to supportting put all demos into a single turn
full_prompt = demo_prompt + splitter + example["question"]
full_prompt = input_template.format(input=full_prompt)
else:
full_prompt = demo_prompt + splitter + contextSince GSM8K has numeric ground-truth answers (gt_ans) instead of multiple-choice letters ("A", "B", "C", "D", "E"), the above condition evaluates to True, causing the few-shot demonstrations (demo_prompt) to be skipped unintentionally. As a result, the model receives only the current question context without the desired few-shot examples. I used the following evaluation script (LLaMA-3.2-1B-Instruct):
set -ex
PROMPT_TYPE=$1
MODEL_NAME_OR_PATH=$2
SAVE_DIR=$3
OUTPUT_DIR=${SAVE_DIR}/math_eval
SPLIT="test"
NUM_TEST_SAMPLE=-1
# English open datasets
DATA_NAME="gsm8k,math"
TOKENIZERS_PARALLELISM=false \
python3 -u math_eval.py \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--data_name ${DATA_NAME} \
--output_dir ${OUTPUT_DIR} \
--split ${SPLIT} \
--prompt_type ${PROMPT_TYPE} \
--num_test_sample ${NUM_TEST_SAMPLE} \
--seed 0 \
--temperature 0 \
--n_sampling 1 \
--top_p 1 \
--start 0 \
--end -1 \
--use_vllm \
--save_outputs \
--overwrite \
--adapt_few_shot \
--num_shots 8 \
--apply_chat_templateProposed Solution
I modified the conditional logic as follows, and verified that the few-shot prompts are correctly applied for GSM8K:
if data_name != "gsm8k" and (
len(demo_prompt) == 0 or (
args.adapt_few_shot and example["gt_ans"] not in ["A", "B", "C", "D", "E"]
)
):
full_prompt = context
else:
if args.prompt_type == "qwen25-math-cot":
# Hotfix to supportting put all demos into a single turn
full_prompt = demo_prompt + splitter + example["question"]
full_prompt = input_template.format(input=full_prompt)
else:
full_prompt = demo_prompt + splitter + contextQuestion
Would it be appropriate to officially integrate this adjustment? Or might I have missed some intended behavior regarding the few-shot logic?