ground truth generation adapted to langchain4j attributes metadata model. documentation update

dantelmomsft · dantelmomsft · commit 09054db08a0e · 2025-06-26T14:57:02.000+02:00
diff --git a/.vscode/launch.json b/.vscode/launch.json
@@ -38,6 +38,22 @@
       "console": "integratedTerminal",
       "justMyCode": false,
       "stopOnEntry": false
+    },
+     {
+      "name": "Debug generate_ground_truth.py",
+      "type": "debugpy",
+      "request": "launch",
+      "program": "${workspaceFolder}/evals/generate_ground_truth.py",
+      "python": "${workspaceFolder}/.evalenv/bin/python",
+      "cwd": "${workspaceFolder}",
+      "args": [
+        "--env-file-path", "./deploy/aca",
+        "--numquestions", "1",
+        "--numsearchdocs", "5"
+      ],
+      "console": "integratedTerminal",
+      "justMyCode": false,
+      "stopOnEntry": false
     }
   ]
-}
+}
diff --git a/docs/aca/evaluation.md b/docs/aca/evaluation.md
@@ -60,7 +60,7 @@ pip install -r evals/requirements.txt
 Generate ground truth data by running the following command:
 
 ```bash
-python evals/generate_ground_truth.py --numquestions=200 --numsearchdocs=1000
+python evals/generate_ground_truth.py --numquestions=200 --numsearchdocs=1000 --env-file-path ./deploy/aca
 ```
 
 The options are:
@@ -74,6 +74,7 @@ The options are:
 
 Review the generated data in `evals/ground_truth.jsonl` after running that script, removing any question/answer pairs that don't seem like realistic user input.
 
+
 ## Run bulk evaluation
 
 Review the configuration in `evals/eval_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
@@ -94,10 +95,26 @@ The options are:
 For more details about how to run locally the chat api see [Local Development with IntelliJ](local-development-intellij.md#running-the-spring-boot-chat-api-locally).
 🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, and the TPM capacity of the evaluation model, and the number of GPT metrics requested.
 
+> [!IMPORTANT]
+> Ground truth data is generated using a knowledge graph created out of the same search index used by the rag flow. It's based on [RAGAS evaluation framework](https://docs.ragas.io/en/stable/).If you want to learn more about data generation approach you can check [Tesset Generation for RAG](https://docs.ragas.io/en/stable/concepts/test_data_generation/rag/)
+
 ## Review the evaluation results
 
 The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
 
+The evaluation uses the following default metrics (as configured in `evaluate_config.json`), with results available in the `summary.json` file:
+
+* **gpt_groundedness**: Measures how well the answer is grounded in the retrieved context. Returns a pass rate and mean rating (1-5 scale).
+* **gpt_relevance**: Evaluates the relevance of the answer to the user's question. Returns a pass rate and mean rating (1-5 scale).
+* **answer_length**: Tracks the length of generated answers in characters (mean, max, min values).
+* **latency**: Measures response time in seconds for each question (mean, max, min values).
+* **citations_matched**: Counts how many answers include properly matched citations from the source documents.
+* **any_citation**: Tracks whether answers include any citations at all.
+
+> [!IMPORTANT]
+> **gpt_groundedness** and **gpt_relevance** are built-in metrics provided by [Azure AI evaluation   sdk](https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/develop/evaluate-sdk).
+**answer length**, **latency**, **citations matched** and **any_citation** are custom metrics defined in [evaluate.py](../../evals/evaluate.py) or from [ai-rag-chat-evaluator project](https://github.com/Azure-Samples/ai-rag-chat-evaluator/blob/main/src/evaltools/eval/evaluate_metrics/code_metrics.py)
+
 You can see a summary of results across all evaluation runs by running the following command:
 
 ```bash
diff --git a/evals/generate_ground_truth.py b/evals/generate_ground_truth.py
@@ -96,7 +96,22 @@ def generate_ground_truth_ragas(num_questions=200, num_search_documents=None, kg
         nodes = []
         for doc in search_docs:
             content = doc["content"]
-            citation = doc["sourcepage"]
+            
+            # Extract citation from metadata attributes
+            if "metadata" in doc and "attributes" in doc["metadata"]:
+                attributes = doc["metadata"]["attributes"]
+                # Convert list of attributes to dictionary for easier lookup
+                attr_dict = {attr["key"]: attr["value"] for attr in attributes}
+                
+                file_name = attr_dict.get("file_name")
+                index = attr_dict.get("page_number")
+                
+                if file_name:
+                    if index is not None:
+                        citation = f"{file_name}#page={index}"
+                    else:
+                        citation = file_name
+            
             node = Node(
                 type=NodeType.DOCUMENT,
                 properties={
@@ -147,15 +162,22 @@ def generate_ground_truth_ragas(num_questions=200, num_search_documents=None, kg
         level=logging.WARNING, format="%(message)s", datefmt="[%X]", handlers=[RichHandler(rich_tracebacks=True)]
     )
     logger.setLevel(logging.INFO)
-    load_azd_env()
+   
 
     parser = argparse.ArgumentParser(description="Generate ground truth data using AI Search index and RAGAS.")
     parser.add_argument("--numsearchdocs", type=int, help="Specify the number of search results to fetch")
     parser.add_argument("--numquestions", type=int, help="Specify the number of questions to generate.", default=200)
     parser.add_argument("--kgfile", type=str, help="Specify the path to an existing knowledge graph file")
+    parser.add_argument("--env-file-path", type=str, help="Specify the path to the environment file.")
 
     args = parser.parse_args()
 
+     # Load environment variables from the specified file path or default
+    if args.env_file_path:
+        load_azd_env(args.env_file_path)
+    else:
+        load_azd_env()
+
     generate_ground_truth_ragas(
         num_search_documents=args.numsearchdocs, num_questions=args.numquestions, kg_file=args.kgfile
     )