Skip to content

人工标注的数据集有公开吗 #35

@lingoubb

Description

@lingoubb

这部分提到的标注后的数据集有公开吗:Dataset. We randomly sample a subset of 400
queries from the complete ALIGNBENCH dataset.
To make sure each category consists of enough
samples to produce reliable results, smaller cat-
egories are upsampled. To cover LLMs with a
wider levels of capability, we adopt answers from
8 LLMs, including GPT-4 (OpenAI, 2023), three
versions of ChatGLM series (Zeng et al., 2022; Du
et al., 2022), Sparkdesk, Qwen-plus-v1-search(Bai
et al., 2023a), InternLM-7B-Chat (Team, 2023)
and Chinese-Llama2-7B-Chat, producing a total
of 3200 question-answer pairings. Subsequent to
the compilation of the evaluation set, the question-
answer-reference triples are delivered to human
annotators, tasked with assigning quality ratings to
the answers according to the references. Given the
inherent limitations bound to human cognition, an-
notators are instructed to employ a rating on a scale
from 1 to 5. The scores are indicative of response
quality, with higher scores epitomizing superior
quality and profound satisfaction. In particular, a
score of 1 marks irrelevant, incorrect, or potentially
harmful responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions