Skip to content

ARC benchmark should have several "modes" per question #2352

@edoardocontente

Description

@edoardocontente

Hi,

The ARC benchmark, registered in deepeval/benchmarks/arc/arc.py should have several modes per question, that should each correspond to a different "self.confinement_instructions" (or rather, turning the attribute self.confinement_instructions into a dict, similar to the TruthfulQA implementation, found in deepeval/benchmarks/truthful_qa/truthful_qa.py). This is because while majority of questions are as follows:
e.g.

'Which are two parts of the carbon cycle?

        A. freezing and thawing

        B. growth and reproduction

        C. evaporation and precipitation

        D. photosynthesis and respiration

        Answer: '

There are some of the following type:

'Different species of carnivorous animals that share the same habitat
        in an ecosystem may

        1. become decomposers

        2. compete for food

        3. produce their own food

        4. mate with each other

        Answer: '

While the self.confinement_instructions is irrespective of the question always "Output 'A', 'B', 'C', or 'D'. Full answer not needed.". This necessarily induces the model to err and try to answer using A, B, C or D.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions