Skip to content

Add Automatic Multi Hop / Long dependency QA generation for evaluation #21

@muthukumaranR

Description

@muthukumaranR

Feature Type

New functionality

Problem Statement

Human generated QA sets are time consuming. Automated QA are hard when they involve complexity more than single hop QA with targeted domain.

Proposed Solution

Advanced Reading Comprehension Dataset Annotation over Individual Earth Science Papers

Objective:
To annotate high-quality QA pairs that require deep reading over full-length earth science papers. The questions should involve synthesizing information from various content types (e.g., text snippets, text-table, text-figure) and require long-form answers.

Annotation Process:
Stage 1: Expert Pilot Annotation
1. Deep Reading & Question Formulation: Experts perform either thorough or skim reading of all sections for a  paper p. Based on a predefined schema (https://proceedings.mlr.press/v202/lee23n/lee23n.pdf), they formulate testing and reasoning-intensive questions. The schema can be adapted to the earth science domain.

2. Answer Construction
    For each question q derived from a paper p, experts:

        a) Highlight all relevant content units in p that support answering q.

        b) Compose a comprehensive long-form answer a grounded in the highlighted evidence.

        c) Indicate whether the answer required additional background knowledge or inference beyond the highlighted content, or the question is not answerable. 

Stage 2: LLM-based auto-annotation
Using the QA taxonomy developed in Stage 1, using a large LLM to generate question-answer pairs for a large set of earth science papers.

Stage 3: Expert verification
Domain experts review and filter the automatically generated QA pairs from Stage 2, verifying their quality.


Advanced Reading Comprehension Dataset Annotation over Multiple Earth Science Papers

Repeat the above process, this time focusing on the related work sections of each paper. Ensure that all QA pairs require synthesizing information from multiple papers.

We need to find ways that exist in the literature that target Multi Hop / long form QA pair generation that need deeper zero shot synthetic QA generation

Alternative Solutions

User Benefits

would speed up testing retrieval systems evaluation

Implementation Ideas

No response

Contribution

  • I'm willing to submit a PR for this feature
  • I'm willing to test this feature
  • I'm willing to help document this feature

Additional Context

  • What does it mean to test Literature review components?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions