[Docs] - Colbert Intro, Example, and Notebook (#401)

mendonk · web-flow · commit 302ab1bb566a · 2024-04-30T17:34:04.000-04:00
* scaffold

* Created using Colab

* push-notebook

* intro

* content

* struct

* more-content

* astra-advantages

* cleanup

* cleanup

* adc

* vector-index

* comparison

* rank
diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
@@ -13,6 +13,10 @@
 * xref:default-architecture:retrieval.adoc[]
 * xref:default-architecture:generation.adoc[]
 
+.ColBERT
+* xref:colbert:index.adoc[]
+* xref:examples:colbert.adoc[]
+
 .Introduction to RAG
 * xref:intro-to-rag:index.adoc[]
 * xref:intro-to-rag:indexing.adoc[]
@@ -25,6 +29,7 @@
 
 .RAGStack Examples
 * xref:examples:index.adoc[]
+* xref:examples:colbert.adoc[]
 * xref:examples:langchain_multimodal_gemini.adoc[]
 * xref:examples:nvidia_embeddings.adoc[]
 * xref:examples:hotels-app.adoc[]
diff --git a/docs/modules/ROOT/pages/packages.adoc b/docs/modules/ROOT/pages/packages.adoc
@@ -36,7 +36,7 @@ Additional LLamaIndex packages should work out of the box, although you need to
 
 The `colbert` module provides a vanilla implementation for ColBERT retrieval. It is not tied to any specific framework and can be used with any of the RAGStack packages.
 
-If you want to use ColBERT with LangChain or LLamaIndex, you can use the following the extras:
+If you want to use ColBERT with LangChain or LLamaIndex, you can use the following extras:
 
 . `ragstack-ai-langchain[colbert]`
 . `ragstack-ai-llamaindex[colbert]`
diff --git a/docs/modules/colbert/pages/index.adoc b/docs/modules/colbert/pages/index.adoc
@@ -0,0 +1,87 @@
+= Introduction to ColBERT
+
+ColBERT stands for "Contextualized Late Interaction over BERT".
+
+"Contextualized Late Interaction" describes a unique method of interacting with Stanford University's https://arxiv.org/abs/2004.12832[BERT]{external-link-icon} model.
+
+ColBERT is a machine learning retrieval model that improves the computational efficiency and contextual depth of information retrieval tasks.
+
+*TL;DR:*
+
+1. BERT embeds text chunks as matrices of token-level vectors, enabling much deeper context matching than a single vector embedding per chunk.
+2. BERT manages this additional depth by pre-processing documents and queries into uniform lengths with the https://huggingface.co/learn/nlp-course/en/chapter6/6[Wordpiece]{external-link-icon} tokenizer, ideal for batch processing on GPUs.
+3. "Contextualized Late Interaction" first retrieves the top-k chunks with the highest similarity scores to query tokens.
+The top-k chunks are then sorted again. The query tokens are compared to every token in the chunk to rank the chunks by the highest aggregate similarity score.
+
+See the xref:examples:colbert.adoc[ColBERT example code] to get started using ColBERT with RAGStack and Astra DB.
+
+== RAGStack-ai-colbert packages
+
+`ragstack-ai-colbert` contains the implementation of the ColBERT retrieval.
+
+The `colbert` module provides a vanilla implementation for ColBERT retrieval. It is not tied to any specific framework and can be used with any of the RAGStack packages.
+
+To use ColBERT with LangChain or LLamaIndex, install ColBERT as an extra:
+
+* `ragstack-ai-langchain[colbert]`
+* `ragstack-ai-llamaindex[colbert]`
+
+== How is ColBERT different from RAG?
+
+In the common RAG usage, a standard embedding model represents each chunk as a single vector embedding.
+This is called "sparse embedding".
+A cosine similarity search performs similarity matching between document and query embeddings, and the top-k results are returned.
+This is fast and straightforward, but some context and efficiency are lost.
+
+In the ColBERT model, each chunk is represented as a list of token-level embedding vectors.
+This is called "dense embedding".
+This per-token "bag of words" within a chunk offers far deeper context than a single vector per chunk.
+Document embeddings are pre-computed and indexed with a uniform length to facilitate batch processing.
+
+ColBERT queries are performed in two stages:
+
+1. The query is embedded (densely) and an Approximate Nearest Neighbor (ANN) search compares every query vector token to every context vector token.
+Recall that the BERT context chunks have embeddings for each token, so this is a dense comparison.
+The closest matches are returned as the top-k chunks.
+2. Contextualized Late Interaction ranks the top-k chunks by a fine-grained similarity score.
+For each query’s token embedding, the score function generates a highest similarity score based on the max dot product of the query token vector, and all the token embeddings per chunk. The aggregate of all the max scores across all the query tokens is the overall similarity score of that particular chunk.
+
+A vector index in the database significantly improves the speed of this comparison.
+
+== ColBERT, RAGStack, and Astra DB
+
+The https://huggingface.co/colbert-ir/colbertv2.0[ColBERT v2.0]{external-link-icon} library transforms a text chunk into a matrix of token-level embeddings. The output is a 128-dimensional vector for each token in a chunk. This results in a two-dimensional matrix, which doesn't align with the current LangChain interface that outputs a list of floats.
+
+To solve this problem, the `ragstack-ai-colbert` packages and extras include new classes for mapping token-level embedding vectors to the Astra DB vector database.
+
+The https://github.com/datastax/ragstack-ai/blob/main/libs/colbert/ragstack_colbert/cassandra_vector_store.py#L20C7-L20C27[CassandraVectorStore]{external-link-icon} class extends the `BaseVectorStore` class to store and retrieve vector embeddings generated by ColBERT.
+
+The Contextualized Late Interaction retrieval logic is defined in the https://github.com/datastax/ragstack-ai/blob/main/libs/colbert/ragstack_colbert/colbert_retriever.py[ColbertRetriever]{external-link-icon} class, which asynchronously retrieves and scores chunks.
+
+Together, these classes enable ColBERT to be used with Astra DB and the RAGStack ecosystem. But why expend all this effort to use ColBERT with Astra DB?
+
+== Advantages of ColBERT on Astra DB
+
+Our testing with ColBERT has shown that it delivers significantly better recall than any single-vector encodings, but this comes at the cost of a significantly larger dataset size.
+
+For a dataset with 10 million passages (which is ~25% of the English-language Wikipedia), the OpenAI-v3-small model requires 61.44GB, while the ColBERT model requires 768GB.
+
+Most vector indexes can't scale to this size due to problems with *index segmentation* and *memory footprint*.
+Astra DB incorporates new vector indexing techniques to address these issues.
+
+*Index segmentation* problems degrade query time. Most vector databases can't index data larger than available RAM in a single physical index, so larger logical indexes are created by splitting the dataset into memory-sized segments. The problem with this approach is that searching within a segment is a logarithmic-time operation in the number of vectors, while combining results across multiple segments is linear-time.  So, as your data set grows past the maximum size of a single segment, query time quickly degrades.
+
+Astra DB has larger-than-memory index construction that allows over an order of magnitude more vectors in a single index segment.
+
+*Memory footprint* problems are expensive. Most vector databases require memory proportional to the number of vectors to serve requests. This is done with either the original full-resolution vectors, or quantized (compressed) vectors. But even 16x compression (which must be done with appropriate reranking to avoid destroying recall) requires 48GB of RAM dedicated to just the compressed vectors of a ColBERT dataset of 10M passages (about 1.5B vectors). Adding in other indexes, (graph index) edge caching, and row caching can easily require expensive 128GB server instances.
+
+Astra DB has fused Asymmetric Distance Computation (ADC) graph traversal that reduces the in-memory footprint of a vector index to near-zero.
+
+On top of these improvements, Astra DB preserves the nonblocking index structure and synchronous, realtime index updates that are the hallmark of Astra’s non-Vector indexes.
+
+DataStax has open-sourced the underlying index technology as https://github.com/jbellis/jvector/[JVector]{external-link-icon}.
+
+For more on the challenges of vector indexing at scale, see:
+
+* https://stackoverflow.com/questions/2703432/what-are-segments-in-lucene[Segments in Lucene]{external-link-icon}
+* https://thenewstack.io/why-vector-size-matters/[Why Vector Size Matters]{external-link-icon}
diff --git a/docs/modules/examples/pages/colbert.adoc b/docs/modules/examples/pages/colbert.adoc
@@ -0,0 +1,153 @@
+= ColBERT in RAGStack with Astra
+
+image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/RAGStackColBERT.ipynb"]
+
+Use ColBERT, Astra DB, and RAGStack to:
+
+. Create ColBERT embeddings
+. Index embeddings on Astra
+. Retrieve embeddings with RAGStack and Astra
+. Use the LangChain ColBERT retriever plugin
+
+== Prerequisites
+
+Import the ragstack-ai-colbert package:
+[source,python]
+----
+pip install ragstack-ai-colbert
+----
+
+== Prepare data and create embeddings
+
+. Prepare documents for chunking.
++
+[source,python]
+----
+arctic_botany_dict = {
+    "Introduction to Arctic Botany": "Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.",
+    "Arctic Plant Adaptations": "Plants in the Arctic have developed unique adaptations to endure the extreme climate. Perennial growth, antifreeze proteins, and a short growth cycle are among the evolutionary solutions. These adaptations not only allow the plants to survive but also to reproduce in short summer months. Arctic plants often have small, dark leaves to absorb maximum sunlight, and some species grow in cushion or mat forms to resist cold winds. Understanding these adaptations provides insights into the resilience of Arctic flora.",
+    "The Tundra Biome": "The Arctic tundra is a vast, treeless biome where the subsoil is permanently frozen. Here, the vegetation is predominantly composed of dwarf shrubs, grasses, mosses, and lichens. The tundra supports a surprisingly rich biodiversity, adapted to its cold, dry, and windy conditions. The biome plays a crucial role in the Earth's climate system, acting as a carbon sink. However, it's sensitive to climate change, with thawing permafrost and shifting vegetation patterns.",
+    "Arctic Plant Biodiversity": "Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.",
+    "Climate Change and Arctic Flora": "Climate change poses a significant threat to Arctic botany, with rising temperatures, melting permafrost, and changing precipitation patterns. These changes can lead to shifts in plant distribution, phenology, and the composition of the Arctic flora. Some species may thrive, while others could face extinction. This dynamic is critical to understanding future Arctic ecosystems and their global impact, including feedback loops that may exacerbate global warming.",
+    "Research and Conservation in the Arctic": "Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.",
+    "Traditional Knowledge and Arctic Botany": "Indigenous peoples of the Arctic have a deep connection with the land and its plant life. Traditional knowledge, passed down through generations, includes the uses of plants for nutrition, healing, and materials. This body of knowledge is invaluable for both conservation and understanding the ecological relationships in Arctic ecosystems. Integrating traditional knowledge with scientific research enriches our comprehension of Arctic botany and enhances conservation strategies.",
+    "Future Directions in Arctic Botanical Studies": "The future of Arctic botany lies in interdisciplinary research, combining traditional knowledge with modern scientific techniques. As the Arctic undergoes rapid changes, understanding the ecological, cultural, and climatic dimensions of Arctic flora becomes increasingly important. Future research will need to address the challenges of climate change, explore the potential for Arctic plants in biotechnology, and continue to conserve this unique biome. The resilience of Arctic flora offers lessons in adaptation and survival relevant to global challenges."
+}
+arctic_botany_chunks = list(arctic_botany_dict.values())
+----
++
+. Start the ColBERT configuration and create embeddings.
++
+[source,python]
+----
+from ragstack_colbert import ColbertEmbeddingModel, ChunkData
+# colbert stuff starts
+colbert = ColbertEmbeddingModel()
+
+chunks = [ChunkData(text=text, metadata={}) for text in arctic_botany_chunks]
+
+embedded_chunks = colbert.embed_chunks(chunks=chunks, doc_id="arctic botany")
+----
++
+. Examine the embeddings.
++
+[source,python]
+----
+assert len(embedded_chunks) == 8
+----
+
+== Create a vector store in Astra
+
+. Ingest embeddings and create a vector store in Astra.
++
+[source,python]
+----
+from ragstack_colbert import CassandraVectorStore
+from getpass import getpass
+import cassio
+
+keyspace="default_keyspace"
+database_id=getpass("Enter your Astra Database Id:")
+astra_token=getpass("Enter your Astra Token:")
+
+cassio.init(token=astra_token, database_id=database_id, keyspace=keyspace)
+session=cassio.config.resolve_session()
+
+db = CassandraVectorStore(
+    keyspace = keyspace,
+    table_name="colbert_embeddings4",
+    session = session,
+)
+----
++
+. Create an index in Astra.
++
+[source,python]
+----
+db.put_chunks(chunks=embedded_chunks, delete_existing=True)
+----
+
+== Retrieve embeddings from the Astra index
+
+Create a RAGStack retriever and ask questions against the indexed embeddings.
+The library provides:
+
+* Embed query tokens
+* Generate candidate documents using Astra ANN search
+* Max similarity scoring
+* Ranking
+
+[source,python]
+----
+import logging
+import nest_asyncio
+nest_asyncio.apply()
+
+logging.getLogger('cassandra').setLevel(logging.ERROR) # workaround to suppress logs
+from ragstack_colbert import ColbertRetriever
+retriever = ColbertRetriever(
+    vector_store=db, embedding_model=colbert
+)
+
+answers = retriever.retrieve("What's artic botany", k=2)
+for answer in answers:
+  print(f"Rank: {answer.rank} Score: {answer.score} Text: {answer.data.text}\n")
+----
+
+.Result
+[source, plain]
+----
+Rank: 1 Score: 35.225955963134766 Text: Arctic botany is the study of plant life in the Arctic, a region characterized by extreme cold, permafrost, and minimal sunlight for much of the year. Despite these harsh conditions, a diverse range of flora thrives here, adapted to survive with minimal water, low temperatures, and high light levels during the summer. This introduction aims to shed light on the resilience and adaptation of Arctic plants, setting the stage for a deeper dive into the unique botanical ecosystem of the Arctic.
+
+Rank: 2 Score: 29.655662536621094 Text: Research in Arctic botany is crucial for understanding the intricate balance of this ecosystem and the impacts of climate change. Scientists conduct studies on plant physiology, genetics, and ecosystem dynamics. Conservation efforts are focused on protecting the Arctic's unique biodiversity through protected areas, sustainable management practices, and international cooperation. These efforts aim to preserve the Arctic flora for future generations and maintain its role in the global climate system.
+----
+
+== LangChain retriever
+
+Alternatively, use the ColBERT extra with your RAGStack package to retrieve documents.
+
+. Install the RAGStack LangChain package with the ColBERT extra.
++
+[source,python]
+----
+pip install ragstack-ai-langchain[colbert]
+----
++
+. Run the LangChain retriever against the indexed embeddings.
++
+[source,python]
+----
+from ragstack_langchain.colbert import ColbertLCRetriever
+
+lc_retriever = ColbertLCRetriever(retriever, k=2)
+docs = lc_retriever.get_relevant_documents("what kind fish lives shallow coral reefs atlantic, india ocean, red sea, gulf of mexico, pacific, and arctic ocean")
+print(f"first answer: {docs[0].page_content}")
+----
++
+.Result
+[source,plain]
+----
+....
+first answer: Despite the challenging environment, the Arctic boasts a significant variety of plant species, each adapted to its niche. From the colorful blooms of Arctic poppies to the hardy dwarf willows, these plants form a complex ecosystem. The biodiversity of Arctic flora is vital for local wildlife, providing food and habitat. This diversity also has implications for Arctic peoples, who depend on certain plant species for food, medicine, and materials.
+....
+----
diff --git a/docs/modules/examples/pages/index.adoc b/docs/modules/examples/pages/index.adoc
@@ -71,6 +71,10 @@ a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",
 |===
 | Description | Colab | Documentation
 
+| Create ColBERT embeddings, index embeddings on Astra, and retrieve embeddings with RAGStack.
+a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/RAGStackColBERT.ipynb"]
+| xref:colbert.adoc[]
+
 | Implement a generative Q&A over your own documentation with {db-serverless} Search, OpenAI, and CassIO.
 a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/QA_with_cassio.ipynb"]
 | xref:qa-with-cassio.adoc[]
diff --git a/examples/notebooks/RAGStackColBERT.ipynb b/examples/notebooks/RAGStackColBERT.ipynb