|
2 | 2 |
|
3 | 3 | Hybrid Knowledge Store combining vector similarity and edges between chunks.
|
4 | 4 |
|
5 |
| -## Documentation |
| 5 | +## Usage |
6 | 6 |
|
7 |
| -[DataStax RAGStack Documentation](https://docs.datastax.com/en/ragstack/docs/index.html) |
| 7 | +1. Pre-process your documents to populate `metadata` information. |
| 8 | +1. Create a Hybrid `KnowledgeStore` and add your LangChain `Document`s. |
| 9 | +1. Retrieve documents from the `KnowledgeStore`. |
8 | 10 |
|
9 |
| -[Quickstart](https://docs.datastax.com/en/ragstack/docs/quickstart.html) |
| 11 | +### Populate Metadata |
10 | 12 |
|
11 |
| -[Examples](https://docs.datastax.com/en/ragstack/docs/examples/index.html) |
| 13 | +The Knowledge Store makes use of the following metadata fields on each `Document`: |
| 14 | + |
| 15 | +- `content_id`: If assigned, this specifies the unique ID of the `Document`. |
| 16 | + If not assigned, one will be generated. |
| 17 | + This should be set if you may re-ingest the same document so that it is overwritten rather than being duplicated. |
| 18 | +- `parent_content_id`: If this `Document` is a chunk of a larger document, you may reference the parent content here. |
| 19 | +- `keywords`: A list of strings representing keywords present in this `Document`. |
| 20 | +- `hrefs`: A list of strings containing the URLs which this `Document` links to. |
| 21 | +- `urls`: A list of strings containing the URLs associated with this `Document`. |
| 22 | + If one webpage is divided into multiple chunks, each chunk's `Document` would have the same URL. |
| 23 | + One webpage may have multiple URLs if it is available in multiple ways. |
| 24 | + |
| 25 | +#### Keywords |
| 26 | + |
| 27 | +To link documents with common keywords, assign the `keywords` metadata of each `Document`. |
| 28 | + |
| 29 | +There are various ways to assign keywords to each `Document`, such as TF-IDF across the documents. |
| 30 | +One easy option is to use the [KeyBERT](https://maartengr.github.io/KeyBERT/index.html). |
| 31 | + |
| 32 | +Once installed with `pip install keybert`, you can add keywords to a list `documents` as follows: |
| 33 | + |
| 34 | +```python |
| 35 | +from keybert import KeyBERT |
| 36 | + |
| 37 | +kw_model = KeyBERT() |
| 38 | +keywords = kw_model.extract_keywords([doc.page_content for doc in pages], |
| 39 | + stop_words='english') |
| 40 | + |
| 41 | +for (doc, kws) in zip(documents, keywords): |
| 42 | + doc.metadata["keywords"] = [kw for (kw, _distance) in kws] |
| 43 | +``` |
| 44 | + |
| 45 | +Rather than taking all the top keywords, you could also limit to those with less than a certain `_distance` to the document. |
| 46 | + |
| 47 | +#### Hyperlinks |
| 48 | + |
| 49 | +To capture hyperlinks, populate the `hrefs` and `urls` metadata fields of each `Document`. |
| 50 | + |
| 51 | +```python |
| 52 | +import re |
| 53 | +link_re = re.compile("href=\"([^\"]+)") |
| 54 | +for doc in documents: |
| 55 | + doc.metadata["content_id"] = doc.metadata["source"] |
| 56 | + doc.metadata["hrefs"] = list(link_re.findall(doc.page_content)) |
| 57 | + doc.metadata["urls"] = [doc.metadata["source"]] |
| 58 | +``` |
| 59 | + |
| 60 | +### Store |
| 61 | + |
| 62 | +```python |
| 63 | +import cassio |
| 64 | +from langchain_openai import OpenAIEmbeddings |
| 65 | +from ragstack_knowledge_store import KnowledgeStore |
| 66 | + |
| 67 | +cassio.init(auto=True) |
| 68 | + |
| 69 | +knowledge_store = KnowledgeStore(embeddings=OpenAIEmbeddings()) |
| 70 | + |
| 71 | +# Store the documents |
| 72 | +knowledge_store.add_documents(documents) |
| 73 | +``` |
| 74 | + |
| 75 | +### Retrieve |
| 76 | + |
| 77 | +```python |
| 78 | +from langchain_openai import ChatOpenAI |
| 79 | + |
| 80 | +llm = ChatOpenAI(model="gpt-4o") |
| 81 | + |
| 82 | +# Retrieve and generate using the relevant snippets of the blog. |
| 83 | +from langchain_core.runnables import RunnablePassthrough |
| 84 | +from langchain_core.output_parsers import StrOutputParser |
| 85 | +from langchain_core.prompts import ChatPromptTemplate |
| 86 | + |
| 87 | +# Depth 0 - don't traverse edges. equivalent to vector-only. |
| 88 | +# Depth 1 - vector search plus 1 level of edges |
| 89 | +retriever = knowledge_store.as_retriever(k=4, depth=1) |
| 90 | + |
| 91 | +template = """You are a helpful technical support bot. You should provide complete answers explaining the options the user has available to address their problem. Answer the question based only on the following context: |
| 92 | +{context} |
| 93 | +
|
| 94 | +Question: {question} |
| 95 | +""" |
| 96 | +prompt = ChatPromptTemplate.from_template(template) |
| 97 | + |
| 98 | +def format_docs(docs): |
| 99 | + formatted = "\n\n".join(f"From {doc.metadata['content_id']}: {doc.page_content}" for doc in docs) |
| 100 | + return formatted |
| 101 | + |
| 102 | + |
| 103 | +rag_chain = ( |
| 104 | + {"context": retriever | format_docs, "question": RunnablePassthrough()} |
| 105 | + | prompt |
| 106 | + | llm |
| 107 | + | StrOutputParser() |
| 108 | +) |
| 109 | +``` |
| 110 | + |
| 111 | +## Development |
| 112 | + |
| 113 | +```shell |
| 114 | +poetry install --with=dev |
| 115 | + |
| 116 | +# Run Tests |
| 117 | +poetry run pytest |
| 118 | +``` |
0 commit comments