Add Automatic Multi Hop / Long dependency QA generation for evaluation

### Feature Type

New functionality

### Problem Statement

Human generated QA sets are time consuming. Automated QA are hard when they involve complexity more than single hop QA with targeted domain. 

### Proposed Solution

- QGen Studio @movinam @jbrry 
- alessandra's suggestion
```
Advanced Reading Comprehension Dataset Annotation over Individual Earth Science Papers

Objective:
To annotate high-quality QA pairs that require deep reading over full-length earth science papers. The questions should involve synthesizing information from various content types (e.g., text snippets, text-table, text-figure) and require long-form answers.

Annotation Process:
Stage 1: Expert Pilot Annotation
1. Deep Reading & Question Formulation: Experts perform either thorough or skim reading of all sections for a  paper p. Based on a predefined schema (https://proceedings.mlr.press/v202/lee23n/lee23n.pdf), they formulate testing and reasoning-intensive questions. The schema can be adapted to the earth science domain.

2. Answer Construction
    For each question q derived from a paper p, experts:

        a) Highlight all relevant content units in p that support answering q.

        b) Compose a comprehensive long-form answer a grounded in the highlighted evidence.

        c) Indicate whether the answer required additional background knowledge or inference beyond the highlighted content, or the question is not answerable. 

Stage 2: LLM-based auto-annotation
Using the QA taxonomy developed in Stage 1, using a large LLM to generate question-answer pairs for a large set of earth science papers.

Stage 3: Expert verification
Domain experts review and filter the automatically generated QA pairs from Stage 2, verifying their quality.


Advanced Reading Comprehension Dataset Annotation over Multiple Earth Science Papers

Repeat the above process, this time focusing on the related work sections of each paper. Ensure that all QA pairs require synthesizing information from multiple papers.
```



We need to find ways that exist in the literature that target Multi Hop / long form QA pair generation that need deeper zero shot synthetic QA generation

### Alternative Solutions

- [Yourbench could be an out of the box alternate as-is](https://github.com/huggingface/yourbench)

### User Benefits

would speed up testing retrieval systems evaluation

### Implementation Ideas

_No response_

### Contribution

- [ ] I'm willing to submit a PR for this feature
- [ ] I'm willing to test this feature
- [ ] I'm willing to help document this feature

### Additional Context

- What does it mean to test Literature review components? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Automatic Multi Hop / Long dependency QA generation for evaluation #21

Feature Type

Problem Statement

Proposed Solution

Alternative Solutions

User Benefits

Implementation Ideas

Contribution

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Automatic Multi Hop / Long dependency QA generation for evaluation #21

Description

Feature Type

Problem Statement

Proposed Solution

Alternative Solutions

User Benefits

Implementation Ideas

Contribution

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions