Fix experimental docs navigation, fix broken tutorials, improve examples for better user understanding (#2156)

sanjeed5 · web-flow · commit ca39b1b6eaf3 · 2025-08-07T11:14:58.000+05:30
This PR improves the experimental documentation with better
organization, clearer navigation, and fixes to code examples throughout
the tutorials.

### 🔄 Navigation &amp; Structure Improvements
- **Renamed "Explanation" → "Core Concepts"** for clearer terminology
- **Reordered tutorials** to follow a logical learning progression:
  1. Prompt → RAG → Workflow → Agent (simple to complex)
- **Reordered core concepts** to match tutorial flow:
  1. Metrics → Datasets → Experimentation
- **Fixed mkdocs.yml path** from `src` to `ragas/src` for proper
documentation generation

### 📝 Content &amp; Code Fixes
- **Standardized API usage** across all examples:
  - Changed `result` → `value` in `MetricResult` objects
  - Fixed `values` → `allowed_values` in metric definitions
  - Updated response handling in RAG examples
- **Enhanced tutorial clarity**:
  - Added setup instructions and prerequisites
  - Improved code explanations and comments
  - Added Quick Start sections for immediate testing
  - Fixed module import paths (`rag_evals` → `rag_eval`, etc.)
- **Improved error handling** in example code with proper API key
validation
- Added experiment result printing for better debugging
- Improved error messages and user guidance

### 🎯 Impact
These changes make the experimental documentation more accessible to new
users while providing a smoother learning experience that progresses
from simple prompt evaluation to complex agent workflows.

All examples have been tested and verified to work with the current API
structure.
diff --git a/docs/experimental/core_concepts/datasets.md b/docs/experimental/core_concepts/datasets.md
diff --git a/docs/experimental/core_concepts/experimentation.md b/docs/experimental/core_concepts/experimentation.md
diff --git a/docs/experimental/core_concepts/index.md b/docs/experimental/core_concepts/index.md
@@ -1,4 +1,4 @@
-# 📚 Explanation
+# 📚 Core Concepts
 
 1. [Metrics](metrics.md)
 2. [Datasets and Experiment Results](datasets.md)
diff --git a/docs/experimental/core_concepts/metrics.md b/docs/experimental/core_concepts/metrics.md
diff --git a/docs/experimental/index.md b/docs/experimental/index.md
@@ -15,11 +15,11 @@ The goal of Ragas Experimental is to evolve Ragas into a general-purpose evaluat
 
     [:octicons-arrow-right-24: Tutorials](tutorials/index.md)
 
-- 📚 **Explanation**
+- 📚 **Core Concepts**
 
     A deeper dive into the principles of evaluation and how Ragas Experimental supports evaluation-driven development for AI applications.
 
-    [:octicons-arrow-right-24: Explanation](explanation/index.md)
+    [:octicons-arrow-right-24: Core Concepts](core_concepts/index.md)
 
 </div>
 
diff --git a/docs/experimental/tutorials/agent.md b/docs/experimental/tutorials/agent.md
@@ -23,7 +23,7 @@ We will start by testing our simple agent that can solve mathematical expression
 python -m ragas_examples.agent_evals.agent
 ```
 
-Next, we will write down a few sample expressions and expected outputs for our agent. Then convert them to a CSV file.
+Next, we will create a few sample expressions and expected outputs for our agent, then convert them to a CSV file.
 
 ```python
 import pandas as pd
@@ -38,7 +38,7 @@ df = pd.DataFrame(dataset)
 df.to_csv("datasets/test_dataset.csv", index=False)
 ```
 
-To evaluate the performance of our agent, we will define a non llm metric that compares if our agent's output is within a certain tolerance of the expected output and outputs 1/0 based on it.
+To evaluate the performance of our agent, we will define a non-LLM metric that compares if our agent's output is within a certain tolerance of the expected output and returns 1/0 based on the comparison.
 
 ```python
 from ragas_experimental.metrics import numeric_metric
@@ -50,7 +50,7 @@ def correctness_metric(prediction: float, actual: float):
     if isinstance(prediction, str) and "ERROR" in prediction:
         return 0.0
     result = 1.0 if abs(prediction - actual) < 1e-5 else 0.0
-    return MetricResult(result=result, reason=f"Prediction: {prediction}, Actual: {actual}")
+    return MetricResult(value=result, reason=f"Prediction: {prediction}, Actual: {actual}")
 ```
 
 Next, we will write the experiment loop that will run our agent on the test dataset and evaluate it using the metric, and store the results in a CSV file.
@@ -74,23 +74,22 @@ async def run_experiment(row):
         "expected_answer": expected_answer,
         "prediction": prediction.get("result"),
         "log_file": prediction.get("log_file"),
-        "correctness": correctness.result
+        "correctness": correctness.value
     }
 ```
 
 Now whenever you make a change to your agent, you can run the experiment and see how it affects the performance of your agent.
 
 ## Running the example end to end
 
-1. Setup your OpenAI API key
-
+1. Set up your OpenAI API key
 ```bash
 export OPENAI_API_KEY="your_api_key_here"
 ```
-2. Run the evaluation
 
+2. Run the evaluation
 ```bash
 python -m ragas_examples.agent_evals.evals
 ``` 
 
-Viola! You have successfully evaluated an AI agent using Ragas. You can now view the results by opening the `experiments/experiment_name.csv` file.
+Voilà! You have successfully evaluated an AI agent using Ragas. You can now view the results by opening the `experiments/experiment_name.csv` file.
diff --git a/docs/experimental/tutorials/prompt.md b/docs/experimental/tutorials/prompt.md
@@ -11,11 +11,24 @@ flowchart LR
 
 We will start by testing a simple prompt that classifies movie reviews as positive or negative. 
 
+First, make sure you have installed ragas examples and setup your OpenAI API key:
+
+```bash
+pip install ragas_experimental[examples]
+export OPENAI_API_KEY = "your_openai_api_key"
+```
+
+Now test the prompt:
+
 ```bash
 python -m ragas_examples.prompt_evals.prompt
 ```
 
-Next, we will write down few sample inputs and expected outputs for our prompt. Then convert them to a a csv file
+This will test the input `"The movie was fantastic and I loved every moment of it!"` and should output `"positive"`.
+
+> **💡 Quick Start**: If you want to see the complete evaluation in action, you can jump straight to the [end-to-end command](#running-the-example-end-to-end) that runs everything and generates the CSV results automatically.
+
+Next, we will write down few sample inputs and expected outputs for our prompt. Then convert them to a CSV file. 
 
 ```python
 import pandas as pd
@@ -33,10 +46,10 @@ Now we need to have a way to measure the performance of our prompt in this task.
 from ragas_experimental.metrics import discrete_metric
 from ragas_experimental.metrics.result import MetricResult
 
-@discrete_metric(name="accuracy", values=["pass", "fail"])
+@discrete_metric(name="accuracy", allowed_values=["pass", "fail"])
 def my_metric(prediction: str, actual: str):
     """Calculate accuracy of the prediction."""
-    return MetricResult(result="pass", reason="") if prediction == actual else MetricResult(result="fail", reason="")
+    return MetricResult(value="pass", reason="") if prediction == actual else MetricResult(value="fail", reason="")
 ```
 
 Next, we will write the experiment loop that will run our prompt on the test dataset and evaluate it using the metric, and store the results in a csv file. 
@@ -67,16 +80,19 @@ Now whenever you make a change to your prompt, you can run the experiment and se
 ## Running the example end to end
 
 1. Setup your OpenAI API key
-
 ```bash
 export OPENAI_API_KEY = "your_openai_api_key"
 ```
-
 2. Run the evaluation
-
 ```bash
 python -m ragas_examples.prompt_evals.evals
 ```
 
-Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file. 
+This will:
+
+- Create the test dataset with sample movie reviews
+- Run the sentiment classification prompt on each sample  
+- Evaluate the results using the accuracy metric
+- Export everything to a CSV file with the results
 
+Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file.
diff --git a/docs/experimental/tutorials/rag.md b/docs/experimental/tutorials/rag.md
@@ -41,7 +41,7 @@ from ragas_experimental.metrics import DiscreteMetric
 my_metric = DiscreteMetric(
     name="correctness",
     prompt = "Check if the response contains points mentioned from the grading notes and return 'pass' or 'fail'.\nResponse: {response} Grading Notes: {grading_notes}",
-    values=["pass", "fail"],
+    allowed_values=["pass", "fail"],
 )
 ```
 
@@ -60,8 +60,8 @@ async def run_experiment(row):
 
     experiment_view = {
         **row,
-        "response": response,
-        "score": score.result,
+        "response": response.get("answer", ""),
+        "score": score.value,
         "log_file": response.get("logs", " "),
     }
     return experiment_view
@@ -72,15 +72,12 @@ Now whenever you make a change to your RAG pipeline, you can run the experiment
 ## Running the example end to end
 
 1. Setup your OpenAI API key
-
 ```bash
-export OPENAI_API_KEY = "your_openai_api_key"
+export OPENAI_API_KEY="your_openai_api_key"
 ```
-
 2. Run the evaluation
-
 ```bash
-python -m ragas_examples.rag_evals.evals
+python -m ragas_examples.rag_eval.evals
 ```
 
-Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file
+Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file.
diff --git a/docs/experimental/tutorials/workflow.md b/docs/experimental/tutorials/workflow.md
@@ -42,13 +42,14 @@ from ragas_experimental.metrics import DiscreteMetric
 my_metric = DiscreteMetric(
     name="response_quality",
     prompt="Evaluate the response based on the pass criteria: {pass_criteria}. Does the response meet the criteria? Return 'pass' or 'fail'.\nResponse: {response}",
-    values=["pass", "fail"],
+    allowed_values=["pass", "fail"],
 )
 ```
 
 Next, we will write the evaluation experiment loop that will run our workflow on the test dataset and evaluate it using the metric, and store the results in a CSV file.
 
 ```python
+from ragas_experimental import experiment
 
 @experiment()
 async def run_experiment(row):
@@ -65,7 +66,7 @@ async def run_experiment(row):
     experiment_view = {
         **row,
         "response": response.get("response_template", " "),
-        "score": score.result,
+        "score": score.value,
         "score_reason": score.reason,
     }
     return experiment_view
@@ -75,13 +76,13 @@ Now whenever you make a change to your workflow, you can run the experiment and
 
 ## Running the example end to end
 1. Setup your OpenAI API key
-
 ```bash
 export OPENAI_API_KEY="your_openai_api_key"
 ```
 
+2. Run the experiment
 ```bash
-python -m ragas_examples.workflow_evals.evals
+python -m ragas_examples.workflow_eval.evals
 ```
 
-Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file
+Voila! You have successfully run your first evaluation using Ragas. You can now inspect the results by opening the `experiments/experiment_name.csv` file.
diff --git a/experimental/ragas_examples/agent_evals/evals.py b/experimental/ragas_examples/agent_evals/evals.py
@@ -62,7 +62,8 @@ async def run_experiment(row):
     
 async def main():
     dataset = load_dataset()
-    _ = await run_experiment.arun(dataset)
+    experiment_result = await run_experiment.arun(dataset)
+    print("Experiment_result: ", experiment_result)
     
     
 if __name__ == "__main__":
diff --git a/experimental/ragas_examples/rag_eval/evals.py b/experimental/ragas_examples/rag_eval/evals.py
@@ -49,7 +49,7 @@ async def run_experiment(row):
 
     experiment_view = {
         **row,
-        "response": response,
+        "response": response.get("answer", ""),
         "score": score.value,
         "log_file": response.get("logs", " "),
     }
@@ -59,7 +59,9 @@ async def run_experiment(row):
 async def main():
     dataset = load_dataset()
     print("dataset loaded successfully", dataset)
-    await run_experiment.arun(dataset)
+    experiment_results = await run_experiment.arun(dataset) 
+    print("Experiment completed successfully!")
+    print("Experiment results:", experiment_results)
 
 if __name__ == "__main__":
     import asyncio
diff --git a/experimental/ragas_examples/rag_eval/rag.py b/experimental/ragas_examples/rag_eval/rag.py
@@ -353,7 +353,12 @@ def query(self, question: str, top_k: int = 3, run_id: str = None) -> Dict[str,
                 }
             ))
             
-            return {"result": result, "logs": self.export_traces_to_log(run_id, question, result)}
+            logs_path = self.export_traces_to_log(run_id, question, result)
+            return {
+                "answer": response,
+                "run_id": run_id,
+                "logs": logs_path
+            }
 
         except Exception as e:
             self.traces.append(TraceEvent(
@@ -368,9 +373,11 @@ def query(self, question: str, top_k: int = 3, run_id: str = None) -> Dict[str,
                         
             
             # Return error result
+            logs_path = self.export_traces_to_log(run_id, question, None)
             return {
                 'answer': f"Error processing query: {str(e)}",
-                'logs': self.export_traces_to_log(run_id, question, None)
+                'run_id': run_id,
+                'logs': logs_path
             }
     
     def export_traces_to_log(self, run_id: str, query: Optional[str] = None, result: Optional[Dict[str, Any]] = None):
@@ -413,7 +420,13 @@ def default_rag_client(llm_client, logdir: str = "logs") -> ExampleRAG:
 
 if __name__ == "__main__":
     
-    api_key = os.environ["OPENAI_API_KEY"]
+    try:
+        api_key = os.environ["OPENAI_API_KEY"]
+    except KeyError:
+        print("Error: OPENAI_API_KEY environment variable is not set.")
+        print("Please set your OpenAI API key:")
+        print("export OPENAI_API_KEY='your_openai_api_key'")
+        exit(1)
     
     # Initialize RAG system with tracing enabled
     llm = OpenAI(api_key=api_key)
@@ -425,6 +438,7 @@ def default_rag_client(llm_client, logdir: str = "logs") -> ExampleRAG:
     
     # Run query with tracing
     query = "What is Ragas"
+    print(f"Query: {query}")
     response = rag_client.query(query, top_k=3)
     
     print("Response:", response['answer'])
diff --git a/experimental/ragas_examples/workflow_eval/evals.py b/experimental/ragas_examples/workflow_eval/evals.py
@@ -107,7 +107,8 @@ async def run_experiment(row):
 
 async def main():
     dataset = load_dataset()
-    _ = await run_experiment.arun(dataset)
+    experiment_result = await run_experiment.arun(dataset)
+    print("Experiment_result: ", experiment_result)
     
 if __name__ == "__main__":
     import asyncio
diff --git a/experimental/ragas_examples/workflow_eval/workflow.py b/experimental/ragas_examples/workflow_eval/workflow.py
diff --git a/mkdocs.yml b/mkdocs.yml

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# 📚 Explanation`
	`1`	`+# 📚 Core Concepts`
`2`	`2`
`3`	`3`	`1. [Metrics](metrics.md)`
`4`	`4`	`2. [Datasets and Experiment Results](datasets.md)`