explodinggradients
diff --git a/‎docs/_static/imgs/testset_output.png‎
-50.9 KB b/‎docs/_static/imgs/testset_output.png‎
-50.9 KB
diff --git a/‎docs/concepts/prompts.md‎
Lines changed: 24 additions & 24 deletions b/‎docs/concepts/prompts.md‎
Lines changed: 24 additions & 24 deletions
diff --git a/‎docs/getstarted/evaluation.md‎
Lines changed: 8 additions & 12 deletions b/‎docs/getstarted/evaluation.md‎
Lines changed: 8 additions & 12 deletions
diff --git a/‎docs/getstarted/index.md‎
Lines changed: 1 addition & 10 deletions b/‎docs/getstarted/index.md‎
Lines changed: 1 addition & 10 deletions
diff --git a/‎docs/getstarted/prepare_data.ipynb‎
Lines changed: 21 additions & 18 deletions b/‎docs/getstarted/prepare_data.ipynb‎
Lines changed: 21 additions & 18 deletions
diff --git a/‎docs/getstarted/testset_generation.md‎
Lines changed: 3 additions & 13 deletions b/‎docs/getstarted/testset_generation.md‎
Lines changed: 3 additions & 13 deletions
diff --git a/‎docs/howtos/applications/custom_prompts.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/howtos/applications/custom_prompts.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/howtos/customisations/azure-openai.ipynb‎
Lines changed: 14 additions & 6 deletions b/‎docs/howtos/customisations/azure-openai.ipynb‎
Lines changed: 14 additions & 6 deletions
diff --git a/‎docs/howtos/integrations/index.md‎
Lines changed: 0 additions & 1 deletion b/‎docs/howtos/integrations/index.md‎
Lines changed: 0 additions & 1 deletion
@@ -2,7 +2,7 @@
 
 Prompts play a crucial role in any language model-based framework and warrant more consideration than mere strings. A well-crafted prompt should include a clear task instruction, articulated in straightforward natural language, comprehensible to any language model. The objective is to compose prompts that are generalizable and do not overly specialize to a specific state of the language model. It's widely recognized that language models exhibit higher accuracy in few-shot scenarios as opposed to zero-shot contexts. To capitalize on this advantage, it is advisable to accompany each prompt with relevant examples.
 
-Prompts in ragas are defined using the `Prompt` class. Each prompt defined using this class will contain. 
+Prompts in ragas are defined using the `Prompt` class. Each prompt defined using this class will contain.
 
 - `name`: a name given to the prompt. Used to save and identify the prompt.
 - `instruction`: The natural language description of the task to be carried out by the LLM
@@ -33,7 +33,7 @@ qa_prompt = Prompt(
     ],
     input_keys=["answer", "context"],
     output_key="output",
-    output_type="JSON",
+    output_type="json",
 )
 ```
 
@@ -46,64 +46,64 @@ This will create a Prompt class object with the given instruction, examples, and
 Prompt objects have the following methods that can be used when evaluating or formatting a prompt object.
 
 - `to_string(self)`
-    
+
     This method will generate a prompt string from the given object. This string can be directly used as a formatted string with the metrics in the evaluation task.
-    
+
     ```{code-block} python
     print(qa_prompt.to_string())
     ```
-    
+
     ```
     Generate a question for the given answer
-    
+
     answer: "The last Olympics was held in Tokyo, Japan."
     context: "The last Olympics was held in Tokyo, Japan. It is held every 4 years"
     output: {{"question": "Where was the last Olympics held?"}}
-    
+
     answer: "It can change its skin color based on the temperature of its environment."
     context: "A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment."
     output: {{"question": "What unique ability does the newly discovered species of frog have?"}}
-    
+
     answer: {answer}
     context: {context}
     output:
     ```
-    
+
 - `format(self, **kwargs)`
-    
+
     This method will use the parameters passed as keyword arguments to format the prompt object and return a Langchain `PromptValue` object that can be directly used in the evaluation tasks.
-    
+
     ```{code-block} python
     qa_prompt.format(answer="This is an answer", context="This is a context")
     ```
-    
+
     ```{code-block} python
     PromptValue(prompt_str='Generate a question for the given answer\n\nanswer: "The last Olympics was held in Tokyo, Japan."\ncontext: "The last Olympics was held in Tokyo, Japan. It is held every 4 years"\noutput: {"question": "Where was the last Olympics held?"}\n\nanswer: "It can change its skin color based on the temperature of its environment."\ncontext: "A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment."\noutput: {"question": "What unique ability does the newly discovered species of frog have?"}\n\nanswer: This is an answer\ncontext: This is a context\noutput: \n')
-    
+
     ```
-    
+
 
 - `save(self, cache_dir)`
-    
-     This method will save the prompt to the given cache_dir (default `~/.cache`) directory using the value in the `name` variable. 
-    
+
+     This method will save the prompt to the given cache_dir (default `~/.cache`) directory using the value in the `name` variable.
+
     ```{code-block} python
     qa_prompt.save()
     ```
-    
+
     The prompts are saved in JSON format to `~/.cache/ragas` by default. One can change this by setting the `RAGAS_CACHE_HOME` environment variable to the desired path. In this example,  the prompt will be saved in `~/.cache/ragas/english/question_generation.json`
-    
+
 - `_load(self, language, name, cache_dir)`
-    
-     This method will load the appropriate prompt from the saved directory. 
-    
+
+     This method will load the appropriate prompt from the saved directory.
+
     ```{code-block} python
     from ragas.utils import RAGAS_CACHE_HOME
     Prompt._load(name="question_generation",language="english",cache_dir=RAGAS_CACHE_HOME)
     ```
-    
+
     ```{code-block} python
     Prompt(name='question_generation', instruction='Generate a question for the given answer', examples=[{'answer': 'The last Olympics was held in Tokyo, Japan.', 'context': 'The last Olympics was held in Tokyo, Japan. It is held every 4 years', 'output': {'question': 'Where was the last Olympics held?'}}, {'answer': 'It can change its skin color based on the temperature of its environment.', 'context': 'A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.', 'output': {'question': 'What unique ability does the newly discovered species of frog have?'}}], input_keys=['answer', 'context'], output_key='output', output_type='JSON')
     ```
-    
+
     The prompt was loaded from `.cache/ragas/english/question_generation.json`
@@ -3,10 +3,6 @@
 
 Once your test set is ready (whether you've created your own or used the [synthetic test set generation module](get-started-testset-generation)), it's time to evaluate your RAG pipeline. This guide assists you in setting up Ragas as quickly as possible, enabling you to focus on enhancing your Retrieval Augmented Generation pipelines while this library ensures that your modifications are improving the entire pipeline.
 
-<p align="left">
-<img src="../_static/imgs/ragas_workflow_white.png" alt="test-outputs" width="800" height="600" />
-</p>
-
 This guide utilizes OpenAI for running some metrics, so ensure you have your OpenAI key ready and available in your environment.
 
 ```python
@@ -21,26 +17,26 @@ Let's begin with the data.
 
 ## The Data
 
-For this tutorial, we'll use an example dataset that we created using example in [data preparation](./prepare_data.ipynb). The dataset contains the following columns:
+For this tutorial, we'll use an example dataset from one of the baselines we created for the [Amnesty QA](https://huggingface.co/datasets/explodinggradients/amnesty_qa) dataset. The dataset contains the following columns:
 
 - question: `list[str]` - These are the questions your RAG pipeline will be evaluated on.
 - context: `list[list[str]]` - The contexts which were passed into the LLM to answer the question.
-- ground_truth: `str` - The ground truth answer to the questions.
-- answer: `str` - The answer generated by the RAG pipeline.
+- ground_truth: `list[str]` - The ground truth answer to the questions.
 
 An ideal test data set should contain samples that closely mirror your real-world use case.
 
 ```{code-block} python
 :caption: import sample dataset
 from datasets import load_dataset
 
-dataset = load_dataset("explodinggradients/prompt-engineering-guide-papers","test_data")
-dataset["test"]
+# loading the V2 dataset
+amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
+amnesty_qa
 ```
 
 :::{seealso}
-See [test set generation](./testset_generation.md) to learn how to generate your own `Question/Ground_Truth` pairs for evaluation.
-See [dataset preparation](./prepare_data.ipynb) to learn how to prepare your own dataset for evaluation.
+See [test set generation](./testset_generation.md) to learn how to generate your own `Question/Context/Ground_Truth` triplets for evaluation.
+See [preparing your own dataset](/docs/howtos/applications/data_preparation.md) to learn how to prepare your own dataset for evaluation.
 :::
 
 ## Metrics
@@ -81,7 +77,7 @@ Running the evaluation is as simple as calling `evaluate` on the `Dataset` with
 from ragas import evaluate
 
 result = evaluate(
-    dataset["eval"],
+    amnesty_qa["eval"],
     metrics=[
         context_precision,
         faithfulness,
 
@@ -6,7 +6,6 @@
 :hidden:
 install.md
 testset_generation.md
-prepare_data.md
 evaluation.md
 monitoring.md
 :::
@@ -27,15 +26,7 @@ Let's get started!
 :link: get-started-testset-generation
 :link-type: ref
 
-Learn how to generate high quality and diverse `Question/Ground_Truth` pairs to get started.
-:::
-
-:::{card} Prepare data for evaluation
-:link: dataset-preparation
-:link-type: ref
-
-Learn how to prepare a complete test dataset for evaluating using ragas metrics.
-
+Learn how to generate `Question/Context/Ground_Truth` triplets to get started.
 :::
 
 :::{card} Evaluate Using Your Testset
 
@@ -182,7 +182,7 @@
    ],
    "source": [
     "eval_dataset = load_dataset(\"explodinggradients/prompt-engineering-guide-papers\")\n",
-    "eval_dataset = eval_dataset['test'].to_pandas()\n",
+    "eval_dataset = eval_dataset[\"test\"].to_pandas()\n",
     "eval_dataset.head()"
    ]
   },
@@ -244,6 +244,7 @@
    "outputs": [],
    "source": [
     "import os\n",
+    "\n",
     "PATH = \"./prompt-engineering-guide-papers\"\n",
     "os.environ[\"OPENAI_API_KEY\"] = \"your-open-ai-key\""
    ]
@@ -266,30 +267,32 @@
     "\n",
     "def build_query_engine(documents):\n",
     "    vector_index = VectorStoreIndex.from_documents(\n",
-    "        documents, service_context=ServiceContext.from_defaults(chunk_size=512),\n",
+    "        documents,\n",
+    "        service_context=ServiceContext.from_defaults(chunk_size=512),\n",
     "    )\n",
     "\n",
     "    query_engine = vector_index.as_query_engine(similarity_top_k=3)\n",
     "    return query_engine\n",
     "\n",
+    "\n",
     "# Function to evaluate as Llama index does not support async evaluation for HFInference API\n",
     "def generate_responses(query_engine, test_questions, test_answers):\n",
-    "  responses = [query_engine.query(q) for q in test_questions]\n",
+    "    responses = [query_engine.query(q) for q in test_questions]\n",
     "\n",
-    "  answers = []\n",
-    "  contexts = []\n",
-    "  for r in responses:\n",
-    "    answers.append(r.response)\n",
-    "    contexts.append([c.node.get_content() for c in r.source_nodes])\n",
-    "  dataset_dict = {\n",
+    "    answers = []\n",
+    "    contexts = []\n",
+    "    for r in responses:\n",
+    "        answers.append(r.response)\n",
+    "        contexts.append([c.node.get_content() for c in r.source_nodes])\n",
+    "    dataset_dict = {\n",
     "        \"question\": test_questions,\n",
     "        \"answer\": answers,\n",
     "        \"contexts\": contexts,\n",
-    "  }\n",
-    "  if test_answers is not None:\n",
-    "    dataset_dict[\"ground_truth\"] = test_answers\n",
-    "  ds = Dataset.from_dict(dataset_dict)\n",
-    "  return ds"
+    "    }\n",
+    "    if test_answers is not None:\n",
+    "        dataset_dict[\"ground_truth\"] = test_answers\n",
+    "    ds = Dataset.from_dict(dataset_dict)\n",
+    "    return ds"
    ]
   },
   {
@@ -299,8 +302,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "reader = SimpleDirectoryReader(PATH,num_files_limit=30, required_exts=[\".pdf\"])\n",
-    "documents = reader.load_data()\n"
+    "reader = SimpleDirectoryReader(PATH, num_files_limit=30, required_exts=[\".pdf\"])\n",
+    "documents = reader.load_data()"
    ]
   },
   {
@@ -310,8 +313,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "test_questions = eval_dataset['question'].values.tolist()\n",
-    "test_answers = eval_dataset['ground_truth'].values.tolist()"
+    "test_questions = eval_dataset[\"question\"].values.tolist()\n",
+    "test_answers = eval_dataset[\"ground_truth\"].values.tolist()"
    ]
   },
   {
 
@@ -1,12 +1,7 @@
 (get-started-testset-generation)=
-# Generate synthetic test data
-The first roadblock to evaluating your RAG system is the lack of a test set. This tutorial guides you in creating a synthetic Question and Ground truth pairs for assessing your RAG pipeline. The key idea here that to synthesize a test set, we need to generate a set of questions and their corresponding ground truths. The ground truths are the expected answers to the questions.
-
-
-Once we have the Question/Ground truth pairs we can feed questions into your RAG to get the contexts and answers. We can then evaluate the RAG pipeline using any metrics of your choice.
-
-For this purpose, we will utilize OpenAI models. Ensure that your OpenAI API key is readily accessible within your environment.
+# Generate a Synthetic Test Set
 
+This tutorial guides you in creating a synthetic evaluation dataset for assessing your RAG pipeline. For this purpose, we will utilize OpenAI models. Ensure that your OpenAI API key is readily accessible within your environment.
 
 ```{code-block} python
 import os
@@ -71,9 +66,4 @@ testset.to_pandas()
 ```
 <p align="left">
 <img src="../_static/imgs/testset_output.png" alt="test-outputs" width="800" height="600" />
-</p>
-
-Now you have a synthetic test set ready for evaluation, which contains `question` and `ground_truth`. 
-
-Next you can input these into your RAG to collect `contexts` and `answers` and evaluate your RAG pipeline using any metrics of your choice. Let's do a simple example using llama-index here. Check the [prepare your evaluation data](./prepare_data.ipynb) to know how.
-
+</p>
@@ -4,7 +4,7 @@ This is a tutorial notebook that shows how to create and use custom prompts with
 
 **Dataset**
 
-Here I’m using a dataset from HuggingFace. 
+Here I’m using a dataset from HuggingFace.
 
 ```{code-block} python
 
@@ -53,7 +53,7 @@ long_form_answer_prompt_new = Prompt(
     ],
     input_keys=["question", "answer"],
     output_key="statements",
-    output_type="JSON",
+    output_type="json",
 )
 ```
 
 
@@ -453,11 +453,13 @@
     "from ragas.testset.evolutions import simple, reasoning, multi_context\n",
     "\n",
     "\n",
-    "loader = DirectoryLoader(\"./2023-llm-papers/\", use_multithreading=True, silent_errors=True,sample_size=1)\n",
+    "loader = DirectoryLoader(\n",
+    "    \"./2023-llm-papers/\", use_multithreading=True, silent_errors=True, sample_size=1\n",
+    ")\n",
     "documents = loader.load()\n",
     "\n",
     "for document in documents:\n",
-    "    document.metadata['filename'] = document.metadata['source']"
+    "    document.metadata[\"filename\"] = document.metadata[\"source\"]"
    ]
   },
   {
@@ -475,11 +477,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "generator = TestsetGenerator.from_langchain(generator_llm=azure_model,critic_llm=azure_model,embeddings=azure_embeddings)\n",
+    "generator = TestsetGenerator.from_langchain(\n",
+    "    generator_llm=azure_model, critic_llm=azure_model, embeddings=azure_embeddings\n",
+    ")\n",
     "\n",
-    "testset = generator.generate_with_langchain_docs(documents, test_size=10, \n",
-    "                                                 raise_exceptions=False, with_debugging_logs=False,\n",
-    "                                                 distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})    "
+    "testset = generator.generate_with_langchain_docs(\n",
+    "    documents,\n",
+    "    test_size=10,\n",
+    "    raise_exceptions=False,\n",
+    "    with_debugging_logs=False,\n",
+    "    distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25},\n",
+    ")"
    ]
   },
   {
 
@@ -12,7 +12,6 @@ langsmith.ipynb
 ragas-arize.ipynb
 langfuse.ipynb
 athina.ipynb
-openlayer.ipynb
 zeno.ipynb
 tonic-validate.ipynb
 ragas_haystack.ipynb