-
Notifications
You must be signed in to change notification settings - Fork 31
Description
这部分提到的标注后的数据集有公开吗:Dataset. We randomly sample a subset of 400
queries from the complete ALIGNBENCH dataset.
To make sure each category consists of enough
samples to produce reliable results, smaller cat-
egories are upsampled. To cover LLMs with a
wider levels of capability, we adopt answers from
8 LLMs, including GPT-4 (OpenAI, 2023), three
versions of ChatGLM series (Zeng et al., 2022; Du
et al., 2022), Sparkdesk, Qwen-plus-v1-search(Bai
et al., 2023a), InternLM-7B-Chat (Team, 2023)
and Chinese-Llama2-7B-Chat, producing a total
of 3200 question-answer pairings. Subsequent to
the compilation of the evaluation set, the question-
answer-reference triples are delivered to human
annotators, tasked with assigning quality ratings to
the answers according to the references. Given the
inherent limitations bound to human cognition, an-
notators are instructed to employ a rating on a scale
from 1 to 5. The scores are indicative of response
quality, with higher scores epitomizing superior
quality and profound satisfaction. In particular, a
score of 1 marks irrelevant, incorrect, or potentially
harmful responses.