add documents after citations

mrmer1 · mrmer1 · commit 2ff858e8f38c · 2024-01-29T19:09:15.000+08:00
diff --git a/notebooks/RAG_Chatbot_with_Chat_Embed_Rerank.ipynb b/notebooks/RAG_Chatbot_with_Chat_Embed_Rerank.ipynb
@@ -100,8 +100,7 @@
     "from unstructured.partition.html import partition_html\n",
     "from unstructured.chunking.title import chunk_by_title\n",
     "\n",
-    "import os\n",
-    "co = cohere.Client(os.environ[\"COHERE_API_KEY\"])"
+    "co = cohere.Client(\"COHERE_API_KEY\")"
    ]
   },
   {
@@ -330,7 +329,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 53,
    "metadata": {},
    "outputs": [
     {
@@ -410,13 +409,16 @@
     "        response = co.chat(message=message, search_queries_only=True)\n",
     "\n",
     "        # If there are search queries, retrieve documents and respond\n",
+    "        preamble_override = \"You only answer questions using on the documents you have provided with\"\n",
+    "        \n",
     "        if response.search_queries:\n",
     "            print(\"Retrieving information...\")\n",
     "\n",
     "            documents = self.retrieve_docs(response)\n",
     "\n",
     "            response = co.chat(\n",
     "                message=message,\n",
+    "                preamble_override = preamble_override,\n",
     "                documents=documents,\n",
     "                conversation_id=self.conversation_id,\n",
     "                stream=True,\n",
@@ -428,7 +430,8 @@
     "        # If there is no search query, directly respond\n",
     "        else:\n",
     "            response = co.chat(\n",
-    "                message=message, \n",
+    "                message=message,\n",
+    "                preamble_override = preamble_override,\n",
     "                conversation_id=self.conversation_id, \n",
     "                stream=True\n",
     "            )\n",
@@ -475,7 +478,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 50,
    "metadata": {},
    "outputs": [
     {
@@ -719,7 +722,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": 56,
    "metadata": {},
    "outputs": [
     {
@@ -762,29 +765,56 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "User: embedding\n",
+      "User: hello\n",
+      "Chatbot:\n",
+      "Hello to you as well! How can I help you today?\n",
+      "\n",
+      "Let me know if there's something specific you would like to discuss or any questions you have and I'll do my best to assist you.\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "\n",
+      "User: what is the difference between word and sentence embeddings\n",
       "Chatbot:\n",
       "Retrieving information...\n",
-      "Embedding is a way to locate each sentence in space, in a way that similar sentences are located close by. It associates each sentence with a particular list of numbers (a vector). Word and sentence embeddings are the bread and butter of language models. In Chapter 9 of the Cohere documentation, there is an example of trying to fit a French sentence (\"Bonjour, comment ça va?\") into an embedding and the struggle it will have to understand that it should be close to the English sentence (\"Hello, how are you?\"). Cohere has trained a large multilingual model to unify many languages into one and be able to understand text in all those languages.\n",
+      "Word embeddings and sentence embeddings are the fundamental components of LLMs and transform language into computer-readable numbers. \n",
+      "\n",
+      "Word embeddings associate words with lists of numbers (vectors) in a way that similar words are grouped close together. Sentence embeddings do the same thing but for sentences. It associates vectors to every sentence.\n",
       "\n",
       "CITATIONS:\n",
-      "{'start': 22, 'end': 51, 'text': 'locate each sentence in space', 'document_ids': ['doc_1']}\n",
-      "{'start': 67, 'end': 106, 'text': 'similar sentences are located close by.', 'document_ids': ['doc_1']}\n",
-      "{'start': 110, 'end': 168, 'text': 'associates each sentence with a particular list of numbers', 'document_ids': ['doc_1']}\n",
-      "{'start': 169, 'end': 179, 'text': '(a vector)', 'document_ids': ['doc_1']}\n",
-      "{'start': 181, 'end': 254, 'text': 'Word and sentence embeddings are the bread and butter of language models.', 'document_ids': ['doc_2']}\n",
-      "{'start': 258, 'end': 295, 'text': 'Chapter 9 of the Cohere documentation', 'document_ids': ['doc_0']}\n",
-      "{'start': 320, 'end': 351, 'text': 'trying to fit a French sentence', 'document_ids': ['doc_0']}\n",
-      "{'start': 402, 'end': 489, 'text': 'the struggle it will have to understand that it should be close to the English sentence', 'document_ids': ['doc_0']}\n",
-      "{'start': 515, 'end': 593, 'text': 'Cohere has trained a large multilingual model to unify many languages into one', 'document_ids': ['doc_0']}\n",
-      "{'start': 598, 'end': 648, 'text': 'be able to understand text in all those languages.', 'document_ids': ['doc_0']}\n",
+      "{'start': 0, 'end': 15, 'text': 'Word embeddings', 'document_ids': ['doc_0', 'doc_1', 'doc_2']}\n",
+      "{'start': 20, 'end': 39, 'text': 'sentence embeddings', 'document_ids': ['doc_0', 'doc_1', 'doc_2']}\n",
+      "{'start': 48, 'end': 78, 'text': 'fundamental components of LLMs', 'document_ids': ['doc_2']}\n",
+      "{'start': 83, 'end': 133, 'text': 'transform language into computer-readable numbers.', 'document_ids': ['doc_2']}\n",
+      "{'start': 136, 'end': 255, 'text': 'Word embeddings associate words with lists of numbers (vectors) in a way that similar words are grouped close together.', 'document_ids': ['doc_0']}\n",
+      "{'start': 256, 'end': 312, 'text': 'Sentence embeddings do the same thing but for sentences.', 'document_ids': ['doc_0', 'doc_1']}\n",
+      "{'start': 316, 'end': 353, 'text': 'associates vectors to every sentence.', 'document_ids': ['doc_0', 'doc_1']}\n",
       "\n",
       "\n",
       "DOCUMENTS:\n",
-      "{'id': 'doc_1', 'text': 'In the previous chapter, we learned that sentence embeddings are the bread and butter of language models, as they associate each sentence with a particular list of numbers (a vector), in a way that similar sentences give similar vectors. We can think of embeddings as a way to locate each sentence in space (a high dimensional space, but a space nonetheless), in a way that similar sentences are located close by. Once we have each sentence somewhere in space, it’s natural to think of distances betw', 'title': 'Similarity Between Words and Sentences', 'url': 'https://docs.cohere.com/docs/similarity-between-words-and-sentences'}\n",
-      "{'id': 'doc_2', 'text': 'Text Embeddings\\n\\nWord and sentence embeddings are the bread and butter of language models. This chapter shows a very simple introduction to what they are.', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}\n",
-      "{'id': 'doc_0', 'text': 'Most word and sentence embeddings are dependent on the language that the model is trained on. If you were to try to fit the French sentence “Bonjour, comment ça va?” (meaning: hello, how are you?) in the embedding from the previous section, it will struggle to understand that it should be close to the sentence “Hello, how are you?” in English. For the purpose of unifying many languages into one, and being able to understand text in all these languages, Cohere has trained a large multilingual mod', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}\n",
+      "{'id': 'doc_0', 'text': 'In the previous chapters, you learned about word and sentence embeddings and similarity between words and sentences. In short, a word embedding is a way to associate words with lists of numbers (vectors) in such a way that similar words are associated with numbers that are close by, and dissimilar words with numbers that are far away from each other. A sentence embedding does the same thing, but associating a vector to every sentence. Similarity is a way to measure how similar two words (or sent', 'title': 'The Attention Mechanism', 'url': 'https://docs.cohere.com/docs/the-attention-mechanism'}\n",
+      "{'id': 'doc_1', 'text': 'This is where sentence embeddings come into play. A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers, in a coherent way. By coherent, I mean that it satisfies similar properties as a word embedding. For instance, similar sentences are assigned to similar vectors, different sentences are assigned to different vectors, and most importantly, each of the coordinates of the vector identifies some (whether clear or obscure) property of', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}\n",
+      "{'id': 'doc_2', 'text': 'Conclusion\\n\\nWord and sentence embeddings are the bread and butter of LLMs. They are the basic building block of most language models, since they translate human speak (words) into computer speak (numbers) in a way that captures many relations between words, semantics, and nuances of the language, into equations regarding the corresponding numbers.', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}\n",
+      "\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "\n",
+      "User: continue\n",
+      "Chatbot:\n",
+      "In some models, the same underlying architecture is used for both word and sentence embeddings. Word embeddings generate vector representations for individual words, while sentence embeddings generate vector representations for entire sentences or phrases.\n",
+      "\n",
+      "Here's a simple analogy: \n",
+      "\n",
+      "Word embeddings are like individuals creating unique fingerprints, identifying unique characteristics. Sentence embeddings are like creating a unique fingerprint for each sentence or phrase. \n",
+      "\n",
+      "Although the concept is similar, the processes are different as word embeddings focus on individual words, while sentence embeddings focus on the entire sentence and capture the overall meaning or context.\n",
+      "----------------------------------------------------------------------------------------------------\n",
+      "\n",
+      "User: what do you know about graph neural networks\n",
+      "Chatbot:\n",
+      "Retrieving information...\n",
+      "I cannot find any specific information on Graph Neural Networks, however, I can provide some information on Transformer Models which are another type of neural network. \n",
+      "\n",
+      " Transformer models are a type of neural network that utilizes attention mechanisms to parse input sequences into multiple layers, with each layer assigning attention weights to the previous layer. Introduced in the paper \"Attention is All You Need\", they have become one of the key components in many NLP applications and are highly effective due to their ability to handle long-range dependencies and capture contextual information.\n",
       "\n",
+      "Would you like me to provide more information on Transformer Models or explain other types of neural networks?\n",
       "----------------------------------------------------------------------------------------------------\n",
       "\n",
       "Ending chat.\n"
@@ -799,7 +829,7 @@
     "app = App(chatbot)\n",
     "\n",
     "# Run the chatbot\n",
-    "app.run()\n"
+    "app.run()"
    ]
   },
   {