-
Notifications
You must be signed in to change notification settings - Fork 51
Description
I am currently working on developing a knowledge question-answering benchmark for a specific domain and aim to learn from your approach to ensure rigor and effectiveness in our dataset design.
While the paper provides valuable insights, I noticed that certain details about the category selection and question processing methods are not fully elaborated. To better understand and potentially replicate your methodology, I would greatly appreciate clarification on the following points:
1、Category Selection: The paper mentions that the original 57 MMLU categories were merged into 14 broader disciplines to reduce redundancy and focus on key knowledge areas. Could you provide more details on how these 14 categories were determined? For example, were they chosen based on statistical analyses (e.g., semantic clustering of questions), expert judgment, problem distribution across sources, or a combination of these factors?
2、Question Formatting: For integrating questions from sources like the STEM Website and TheoremQA, how were the prompts designed for GPT-4-Turbo to extract concise answers from solutions or brief answers? Were specific templates or NLP techniques (e.g., text summarization) used to ensure accuracy and consistency in formatting questions into a multiple-choice format?
3、Distractor Generation: What strategies or prompts were used to generate the initial three distractors and the additional six distractors during the option augmentation phase (expanding from four to ten options)? For instance, were distractors based on common error patterns, semantic similarity, or domain-specific knowledge to ensure they were plausible yet challenging?
4、Option Quality Control: Beyond expert review, were any automated tools (e.g., semantic analysis, knowledge bases) or iterative processes employed to ensure the plausibility and challenge of the distractors, particularly during the option augmentation phase?
These details would significantly enhance my understanding of MMLU-Pro’s construction process and guide our efforts to create a domain-specific benchmark. Thank you for your time and for sharing your impactful work.