I modified the prompts in RAGAs to create a knowledge graph and testset that suits my case, but how can I say that this is better?

- [x] I checked the [documentation](https://docs.ragas.io/) and related resources and couldn't find an answer to my question.

**Your Question**
what is unclear to you? What would you like to know?

## TL;DR

- I've been experiencing several issues with RAGAs on my domain.
- So I modified the prompts to address these issues.
- I feel like things have improved overall. Is there a way to objectively explain this?
- Or should I just feel okay for now, since the issues I've encountered seem to have disappeared?

## Situation Details

Hello, I've been finding RAGAs, the amazing RAG system evaluation tool, very useful.
I've been applying RAG to hundreds of document files. To evaluate this system, I (1) generated a knowledge graph and (2) ran scenario generation, as suggested in the [official RAGAs documentation](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/).

When I ran this process in my domain, I encountered two issues.
First, I detected excessive extraction of article or clause numbers during the knowledge graph generation process.
Since the documents I work with are primarily regulatory documents, phrases like "Article 0" and "Section 0" appear repeatedly. RAGAs's `NERExtractor` identified these as entities, resulting in relationships being formed between semantically unrelated documents, creating giant components.

```json
        "entities": [
          "2012. 10. 5.",
          "2020. 5. 1.",
          "2021. 1. 1.",
          "2023. 10. 17.",
          "1989. 2. 20.",
          "2013. 1. 7.",
          "1997. 3. 17.",
          "2025. 4. 18.",
          "1998. 9. 30.",
          "1990. 2. 16."
        ]
```

```json
        "entities": [
          "Chapter 5",
          "Chapter 6",
          "Article 61",
          "Article 62",
          "Article 63",
          "Article 64",
          "Article 65",
          "Article 66",
          "Article 67",
          "Article 68"
        ]
```

Second, I found that the testset generation process often generated QA questions that were too easy or simple. Questions generated by tools like the `SingleHopSpecificQuerySynthesizer` or `MultiHopSpecificQuerySynthesizer` were sometimes vague or generic (e.g., What is the metropolitan mayor involved in?).

## What I tried to solve this problem

To address the first problem, naive extraction, I used prompt engineering and filtering using regular expressions. I modified the instructions in the `NERPrompt` class to be more stringent and added a regular expression entity filter to `NERExtractor`.

```python
class BetterNERPrompt(NERPrompt):
    instruction: str = (
        "Extract the named entities from the given text, limiting the output to the top entities. "
        "Focus on identifying specific names of people, organizations, locations, laws, regulations, and other proper nouns. "
        "Broadly exclude common nouns, dates, monetary values, and document-specific references like article or clause numbers (e.g., 'Chapter 0', 'Article 0', and all similar patterns). "
        "Ensure the number of entities does not exceed the specified maximum."
    )
    examples: t.List[t.Tuple[TextWithExtractionLimit, NEROutput]] = [
        (
            TextWithExtractionLimit(
                text="""Elon Musk, the CEO of Tesla and SpaceX, announced plans to expand operations to new locations in Europe and Asia.
                This expansion is expected to create thousands of jobs, particularly in cities like Berlin and Shanghai.""",
                max_num=10,
            ),
            NEROutput(
                entities=[
                    "Elon Musk",
                    "Tesla",
                    "SpaceX",
                    "Europe",
                    "Asia",
                    "Berlin",
                    "Shanghai",
                ]
            ),
        ),
    ]


@dataclass
class BetterNERExtractor(NERExtractor):
    """
    Extracts named entities from the given text.

    Attributes
    ----------
    property_name : str
        The name of the property to extract. Defaults to "entities".
    prompt : NERPrompt
        The prompt used for extraction.
    """

    prompt: PydanticPrompt[TextWithExtractionLimit, NEROutput] = BetterNERPrompt()

    async def extract(self, node: Node) -> t.Tuple[str, t.List[str]]:
        node_text = node.get_property("page_content")
        if node_text is None:
            return self.property_name, []
        chunks = self.split_text_by_token_limit(node_text, self.max_token_limit)
        entities = []
        for chunk in chunks:
            result = await self.prompt.generate(
                self.llm,
                data=TextWithExtractionLimit(text=chunk, max_num=self.max_num_entities),
            )
            entities.extend(result.entities)

        # Post-processing to filter out unwanted patterns
        unwanted_patterns = re.compile(
            r"^Chapter \d"
            r"^Article \d"
            r"^\d{4}\.\s*\d{1,2}\.\s*\d{1,2}\.?$"  # e.g., 2012. 10. 5.
        )
        filtered_entities = [
            entity for entity in entities if not unwanted_patterns.match(entity.strip())
        ]
        return self.property_name, list(set(filtered_entities))
```

Second, to reduce the generation of overly simplistic QA testsets, I also implemented prompt engineering. Specifically, I replaced the `query_answer_generation_prompt` in `SingleHopSpecificQuerySynthesizer` with the prompt below.

```python
custom_single_hop_instruction = (
    """You are a QuerySynthesizer specialized in generating evaluation questions for RAG (Retrieval-Augmented Generation) systems.  
Your goal is to create high-quality, context-faithful questions from the provided document.  
These questions will be used to evaluate whether a RAG system can retrieve and answer accurately.

## Instructions:
1. **Context-faithful**: Only create questions that can be answered entirely using the given document.  
   - If the answer requires guessing, common sense outside the document, or personal opinion, discard the question.  
   - The answer must be explicitly present or directly inferable from the text.

2. **Relevance**:  
   - Focus on operational procedures, definitions, restrictions, approval processes, responsible roles, and mandatory requirements stated in the document.  
   - Avoid trivial questions with obvious answers (e.g., “What is the title of the document?”).  
   - Avoid overly broad questions that require summarizing the whole document.

3. **Question types**:  
   - Include a mix of **factual lookup** questions and **context comprehension** questions.  
     Example:  
       - Lookup: “Who chairs the research network security review committee?”  
       - Comprehension: “Why must the research network be physically separated from other networks?”

4. **Clarity**:  
   - Write each question so it is clear, unambiguous, and concise.  
   - Avoid double-barreled questions (asking two things at once).

5. **Diversity**:  
   - Cover different sections of the document so questions are not clustered in one part.  
   - Include policy requirements, exceptions, responsibilities, and process details.

6. **Output format**:  
   - Be sure to follow the language stated in your persona.

## Self-Check before output:
For each generated question, ask yourself:
- “Can I answer this using only the provided document?”
- “Is this relevant to operational details or compliance rules in the text?”
- “Does this avoid triviality and ambiguity?”

The answers to all of the above questions must be Yes."""
)

custom_multi_hop_instruction = (
    """You are a MultiHopSpecificQuerySynthesizer for evaluating RAG systems.
Your task is to produce ONE high-quality, context-faithful multi-hop question and its answer.

## Inputs
- persona: a short description of the role/perspective (e.g., security officer, network admin).
- themes: an array of short phrases extracted/generated from the context that justify multi-hop composition.
- style: brief stylistic preference for the question wording.
- length: desired question length guideline (short/medium).
- context_segments: multiple text chunks, each tagged as <1-hop>, <2-hop>, <3-hop>, ...

## Objectives
1) Multi-hop requirement:
   - The question MUST require combining at least TWO distinct context segments (e.g., <1-hop> + <2-hop>).
   - The two (or more) sub-facts must be indispensable to answer correctly (no single segment should suffice).

2) Specificity requirement:
   - Make the question concrete: include exact entities/roles, conditions, deadlines, versions, or approval authorities found in the text.
   - Avoid vague “why”/opinion questions unless the explanation is explicitly stated in the context.

3) Theme incorporation:
   - Explicitly incorporate AT LEAST ONE theme phrase in the question text.
   - Prefer exact phrase match; if awkward, use a minimally altered paraphrase while preserving meaning.

4) Faithfulness:
   - The answer MUST be constructed ONLY from the provided segments (verbatim or directly inferable).
   - Do NOT import outside knowledge or unstated assumptions.

5) Clarity:
   - Single, unambiguous question. No double-barreled prompts.
   - Use the persona to shape what is asked (scope and angle), not to add new facts.

## Construction Steps
a) Identify two or more segments that connect via a concrete relation (e.g., role→approval chain, condition→exception, requirement→deadline).
b) Draft a question that forces the combination of those segments and embeds a theme phrase.
c) Compile the answer strictly from the supporting segments.
d) Extract minimal supporting evidence (short snippets) from each used segment.

## Self-Check (apply before output)
- Can this be answered ONLY by combining ≥2 segments?
- Does the question explicitly include at least one theme phrase?
- Are roles/conditions/numbers/dates stated precisely as in the context?
- Is the answer free of speculation or outside knowledge?

The answers to all of the above questions must be Yes."""
)
```

## wrapup

The approach to both problems was quite effective, and I'm currently using this modified version of the code. However, I'm curious whether these arbitrary changes I made are objectively appropriate.

How can we objectively explain whether a knowledge graph is of higher quality or a better test set? I wonder if there's a suitable methodology or theory to explain this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I modified the prompts in RAGAs to create a knowledge graph and testset that suits my case, but how can I say that this is better? #2233

TL;DR

Situation Details

What I tried to solve this problem

wrapup

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

I modified the prompts in RAGAs to create a knowledge graph and testset that suits my case, but how can I say that this is better? #2233

Description

TL;DR

Situation Details

What I tried to solve this problem

wrapup

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions