Added testset generation for bedrock (#626)

VpkPrasanna · shahules786 · web-flow · commit 402dc7eb3f6d · 2024-02-18T12:58:31.000-08:00
Testset generation using bedrock model and embeddings

---------

Co-authored-by: Shahules786 &lt;Shahules786@gmail.com&gt;
diff --git a/docs/howtos/customisations/aws-bedrock.ipynb b/docs/howtos/customisations/aws-bedrock.ipynb
@@ -9,7 +9,10 @@
     "\n",
     "Amazon Bedrock is a fully managed service that makes FMs from leading AI startups and Amazon available via an API, so you can choose from a wide range of FMs to find the model that is best suited for your use case.\n",
     "\n",
-    "This tutorial will show you how to use Amazon Bedrock endpoints and LangChain."
+    "This tutorial will show you how to use Amazon Bedrock with Ragas.\n",
+    "\n",
+    "1. [Metrics](#load-sample-dataset)\n",
+    "2. [Testset generation](#test-data-generation)"
    ]
   },
   {
@@ -22,6 +25,14 @@
     ":::"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "f466494a",
+   "metadata": {},
+   "source": [
+    "## Metrics"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "e54b5e01",
@@ -330,6 +341,143 @@
     "df.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b133aff0",
+   "metadata": {},
+   "source": [
+    "## Test Data Generation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c7192f2",
+   "metadata": {},
+   "source": [
+    "Load the documents using desired dataloader."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "529266ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.document_loaders import UnstructuredURLLoader\n",
+    "\n",
+    "urls = [\n",
+    "    \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-8-2023\",\n",
+    "    \"https://www.understandingwar.org/backgrounder/russian-offensive-campaign-assessment-february-9-2023\",\n",
+    "]\n",
+    "loader = UnstructuredURLLoader(urls=urls)\n",
+    "documents = loader.load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "87587749",
+   "metadata": {},
+   "source": [
+    "now we have documents created in the form of langchain `Document`\n",
+    "Next step is to wrap the embedding and llm model into ragas schema."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1d5eaed2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.llms import LangchainLLMWrapper\n",
+    "from ragas.embeddings.base import LangchainEmbeddingsWrapper\n",
+    "\n",
+    "bedrock_model = LangchainLLMWrapper(bedrock_model)\n",
+    "bedrock_embeddings = LangchainEmbeddingsWrapper(bedrock_embeddings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d7d17468",
+   "metadata": {},
+   "source": [
+    "Next Step is to create chunks from the documents  and store the chunks `InMemoryDocumentStore`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4e717c13",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.testset.extractor import KeyphraseExtractor\n",
+    "from langchain.text_splitter import TokenTextSplitter\n",
+    "from ragas.testset.docstore import InMemoryDocumentStore\n",
+    "\n",
+    "splitter = TokenTextSplitter(chunk_size=1000, chunk_overlap=100)\n",
+    "keyphrase_extractor = KeyphraseExtractor(llm=bedrock_model)\n",
+    "\n",
+    "docstore = InMemoryDocumentStore(\n",
+    "    splitter=splitter,\n",
+    "    embeddings=bedrock_embeddings,\n",
+    "    extractor=keyphrase_extractor,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7773f4b5",
+   "metadata": {},
+   "source": [
+    "Initializing `TestsetGenerator` with required arguments and generating data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "495ff805",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.testset import TestsetGenerator\n",
+    "from ragas.testset.evolutions import simple, reasoning, multi_context\n",
+    "\n",
+    "test_generator = TestsetGenerator(\n",
+    "    generator_llm=bedrock_model,\n",
+    "    critic_llm=bedrock_model,\n",
+    "    embeddings=bedrock_embeddings,\n",
+    "    docstore=docstore,\n",
+    ")\n",
+    "\n",
+    "distributions = {simple: 0.5, reasoning: 0.25, multi_context: 0.25}\n",
+    "\n",
+    "# use generator.generate_with_llamaindex_docs if you use llama-index as document loader\n",
+    "testset = test_generator.generate_with_langchain_docs(\n",
+    "    documents=documents, test_size=10, distributions=distributions\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8a80046b",
+   "metadata": {},
+   "source": [
+    "Export the results into pandas¶"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0b4633c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_df = testset.to_pandas()\n",
+    "test_df.head()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "f668fce1",