Skip to content

Commit 236884a

Browse files
epinzurNadirJ
andauthored
added unstructured docs (#302)
* added unstructured docs * Update langchain-unstructured-astra.adoc changed AstraDB to Astra DB in --------- Co-authored-by: Nadir J <[email protected]>
1 parent 48bc55e commit 236884a

File tree

4 files changed

+326
-5
lines changed

4 files changed

+326
-5
lines changed

docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
* xref:examples:langchain-evaluation.adoc[]
2929
* xref:examples:advanced-rag.adoc[]
3030
* xref:examples:flare.adoc[]
31+
* xref:examples:langchain-unstructured-astra.adoc[]
3132
* xref:examples:llama-astra.adoc[]
3233
* xref:examples:llama-parse-astra.adoc[]
3334
* xref:examples:qa-with-cassio.adoc[]

docs/modules/examples/pages/index.adoc

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@ a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",
4343
a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/FLARE.ipynb"]
4444
| xref:flare.adoc[]
4545

46+
| Build a simple RAG pipeline using Unstructured and AstraDB.
47+
a| image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/langchain-unstructured-astra.ipynb"]
48+
| xref:langchain-unstructured-astra.adoc[]
49+
4650
|===
4751

4852
[[llama-astra]]
Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
= RAG with Unstructured and Astra DB
2+
3+
image::https://colab.research.google.com/assets/colab-badge.svg[align="left",link="https://colab.research.google.com/github/datastax/ragstack-ai/blob/main/examples/notebooks/langchain-unstructured-astra.ipynb"]
4+
5+
Build a RAG pipeline with RAGStack, Astra DB, and Unstructured.
6+
7+
This example demonstrates loading and parsing a PDF document with Unstructured into an Astra DB vector store, then querying the index with LangChain.
8+
9+
== Prerequisites
10+
11+
You will need a vector-enabled Astra database.
12+
13+
* Create an https://docs.datastax.com/en/astra-serverless/docs/getting-started/create-db-choices.html[Astra
14+
vector database].
15+
* Within your database, create an https://docs.datastax.com/en/astra-serverless/docs/manage/org/manage-tokens.html[Astra
16+
DB Access Token] with Database Administrator permissions.
17+
* Get your Astra DB Endpoint:
18+
** `+https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com+`
19+
* Create an API key at https://cloud.llamaindex.ai/[LlamaIndex.ai].
20+
Install the following dependencies:
21+
[source,python]
22+
----
23+
pip install ragstack-ai python-dotenv
24+
----
25+
See the https://docs.datastax.com/en/ragstack/docs/prerequisites.html[Prerequisites] page for more details.
26+
27+
== Set up your environment
28+
29+
Create a `.env` file in your application with the following environment variables:
30+
[source,bash]
31+
----
32+
UNSTRUCTURED_API_KEY=...
33+
ASTRA_DB_API_ENDPOINT=https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com
34+
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
35+
OPENAI_API_KEY=sk-...
36+
----
37+
38+
If you're using Google Colab, you'll be prompted for these values in the Colab environment.
39+
40+
See the https://docs.datastax.com/en/ragstack/docs/prerequisites.html[Prerequisites] page for more details.
41+
42+
== Create RAG pipeline
43+
44+
. Import dependencies and load environment variables.
45+
+
46+
[source,python]
47+
----
48+
import os
49+
import requests
50+
51+
from dotenv import load_dotenv
52+
from langchain_community.vectorstores import AstraDB
53+
from langchain_core.documents import Document
54+
from langchain_core.output_parsers import StrOutputParser
55+
from langchain_core.prompts import PromptTemplate
56+
from langchain_core.runnables import RunnablePassthrough
57+
58+
from langchain_community.document_loaders import (
59+
unstructured,
60+
UnstructuredAPIFileLoader,
61+
)
62+
63+
from langchain_openai import (
64+
ChatOpenAI,
65+
OpenAIEmbeddings,
66+
)
67+
68+
load_dotenv()
69+
----
70+
+
71+
. For this example we will focus on pages 9 & 10 of a PDF about attention mechanisms in transformer model architectures. The original source of the paper is available here: https://arxiv.org/pdf/1706.03762.pdf
72+
+
73+
[source,python]
74+
----
75+
url = "https://raw.githubusercontent.com/datastax/ragstack-ai/48bc55e7dc4de6a8b79fcebcedd242dc1254dd63/examples/notebooks/resources/attention_pages_9_10.pdf"
76+
file_path = "./attention_pages_9_10.pdf"
77+
78+
response = requests.get(url)
79+
if response.status_code == 200:
80+
with open(file_path, "wb") as file:
81+
file.write(response.content)
82+
print("Download complete.")
83+
else:
84+
print("Error downloading the file.")
85+
----
86+
+
87+
. Parse the downloaded PDF with Unstructured into elements for indexing. Choose either _Simple Parsing_ or _Advanced Parsing_:
88+
+
89+
**Simple Parsing:**
90+
+
91+
This works well if your document doesn't contain any complex formatting or tables.
92+
+
93+
[source,python]
94+
----
95+
loader = UnstructuredAPIFileLoader(
96+
file_path="./attention_pages_9_10.pdf",
97+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
98+
)
99+
simple_docs = loader.load()
100+
101+
print(len(simple_docs))
102+
print(simple_docs[0].page_content[0:400])
103+
----
104+
+
105+
By default, the parser returns 1 document per pdf file. The sample output of the document contents shows the first table's description, and the start of a poorly formatted table.
106+
+
107+
**Advanced Parsing:**
108+
+
109+
By changing the processing strategy and response mode, we can get more detailed document structure. Unstructured can break the document into elements of different types, which can be helpful for improving your RAG system.
110+
+
111+
For example, the `Table` element type includes the table formatted as simple html, which can help the LLM answer questions from the table data, and we could exclude elements of type `Footer` from our vector store.
112+
+
113+
A list of all the different element types can be found here: https://unstructured-io.github.io/unstructured/introduction/overview.html#id1
114+
+
115+
[source,python]
116+
----
117+
elements = unstructured.get_elements_from_api(
118+
file_path="./attention_pages_9_10.pdf",
119+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
120+
strategy="hi_res", # default "auto"
121+
pdf_infer_table_structure=True,
122+
)
123+
124+
print(len(elements))
125+
tables = [el for el in elements if el.category == "Table"]
126+
print(tables[1].metadata.text_as_html)
127+
----
128+
+
129+
In the Advanced Parsing mode, we now get 27 elements instead of a single document, and table structure is available as html.
130+
+
131+
See the Colab notebook linked at the top of this page for a more detailed investigation into the benefits of using the Advanced Parsing mode.
132+
+
133+
. Create an AstraDB vector store instance.
134+
+
135+
[source,python]
136+
----
137+
astra_db_store = AstraDB(
138+
collection_name="langchain_unstructured",
139+
embedding=OpenAIEmbeddings(),
140+
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
141+
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT")
142+
)
143+
----
144+
+
145+
. Create LangChain documents by chunking the text after `Table` elements and before `Title` elements. Use the html output format for table data. Insert the documents into Astra.
146+
+
147+
[source,python]
148+
----
149+
documents = []
150+
current_doc = None
151+
152+
for el in elements:
153+
if el.category in ["Header", "Footer"]:
154+
continue # skip these
155+
if el.category == "Title":
156+
documents.append(current_doc)
157+
current_doc = None
158+
if not current_doc:
159+
current_doc = Document(page_content="", metadata=el.metadata.to_dict())
160+
current_doc.page_content += el.metadata.text_as_html if el.category == "Table" else el.text
161+
if el.category == "Table":
162+
documents.append(current_doc)
163+
current_doc = None
164+
165+
astra_db_store.add_documents(documents)
166+
----
167+
. Build a RAG pipeline using the populated Astra vector store.
168+
+
169+
[source,python]
170+
----
171+
prompt = """
172+
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
173+
Context: {context}
174+
Question: {question}
175+
Your answer:
176+
"""
177+
178+
llm = ChatOpenAI(model="gpt-3.5-turbo-16k", streaming=False, temperature=0)
179+
180+
chain = (
181+
{"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
182+
| PromptTemplate.from_template(prompt)
183+
| llm
184+
| StrOutputParser()
185+
)
186+
----
187+
188+
== Execute queries
189+
190+
. Ask a question that should be answered by the text of the document - this query should return a relevant response.
191+
+
192+
[source,python]
193+
----
194+
response_1 = chain.invoke("What does reducing the attention key size do?")
195+
print("\n***********New Unstructured Basic Query Engine***********")
196+
print(response_1)
197+
----
198+
+
199+
. Ask a question that can be answered from the table data. This highlights the power of using Unstructured.
200+
+
201+
[source,python]
202+
----
203+
response_2 = chain.invoke("For the transformer to English constituency results, what was the 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]'?")
204+
print("\n***********New Unstructured Basic Query Engine***********")
205+
print(response_2)
206+
----
207+
. Ask a question with an expected lack of context.
208+
This query should return `I don't know. The context does not provide any information about George Washington's birthdate.` because your document does not contain information about the George Washington.
209+
+
210+
[source,python]
211+
----
212+
response_3 = chain.invoke("When was George Washington born?")
213+
print("\n***********New Unstructured Basic Query Engine***********")
214+
print(response_3)
215+
----
216+
217+
== Complete code (Advanced Parsing)
218+
219+
.Python
220+
[%collapsible%open]
221+
====
222+
[source,python]
223+
----
224+
import os
225+
import requests
226+
227+
from dotenv import load_dotenv
228+
from langchain_community.document_loaders import unstructured
229+
from langchain_community.vectorstores import AstraDB
230+
from langchain_core.documents import Document
231+
from langchain_core.output_parsers import StrOutputParser
232+
from langchain_core.prompts import PromptTemplate
233+
from langchain_core.runnables import RunnablePassthrough
234+
235+
from langchain_openai import (
236+
ChatOpenAI,
237+
OpenAIEmbeddings,
238+
)
239+
240+
load_dotenv()
241+
242+
url = "https://raw.githubusercontent.com/datastax/ragstack-ai/48bc55e7dc4de6a8b79fcebcedd242dc1254dd63/examples/notebooks/resources/attention_pages_9_10.pdf"
243+
file_path = "./attention_pages_9_10.pdf"
244+
245+
response = requests.get(url)
246+
if response.status_code == 200:
247+
with open(file_path, "wb") as file:
248+
file.write(response.content)
249+
print("Download complete.")
250+
else:
251+
print("Error downloading the file.")
252+
exit(1)
253+
254+
elements = unstructured.get_elements_from_api(
255+
file_path="./attention_pages_9_10.pdf",
256+
api_key=os.getenv("UNSTRUCTURED_API_KEY"),
257+
strategy="hi_res", # default "auto"
258+
pdf_infer_table_structure=True,
259+
)
260+
261+
astra_db_store = AstraDB(
262+
collection_name="langchain_unstructured",
263+
embedding=OpenAIEmbeddings(),
264+
token=os.getenv("ASTRA_DB_APPLICATION_TOKEN"),
265+
api_endpoint=os.getenv("ASTRA_DB_API_ENDPOINT")
266+
)
267+
268+
documents = []
269+
current_doc = None
270+
271+
for el in elements:
272+
if el.category in ["Header", "Footer"]:
273+
continue # skip these
274+
if el.category == "Title":
275+
documents.append(current_doc)
276+
current_doc = None
277+
if not current_doc:
278+
current_doc = Document(page_content="", metadata=el.metadata.to_dict())
279+
current_doc.page_content += el.metadata.text_as_html if el.category == "Table" else el.text
280+
if el.category == "Table":
281+
documents.append(current_doc)
282+
current_doc = None
283+
284+
astra_db_store.add_documents(documents)
285+
286+
prompt = """
287+
Answer the question based only on the supplied context. If you don't know the answer, say "I don't know".
288+
Context: {context}
289+
Question: {question}
290+
Your answer:
291+
"""
292+
293+
llm = ChatOpenAI(model="gpt-3.5-turbo-16k", streaming=False, temperature=0)
294+
295+
chain = (
296+
{"context": astra_db_store.as_retriever(), "question": RunnablePassthrough()}
297+
| PromptTemplate.from_template(prompt)
298+
| llm
299+
| StrOutputParser()
300+
)
301+
302+
response_1 = chain.invoke("What does reducing the attention key size do?")
303+
print("\n***********New Unstructured Basic Query Engine***********")
304+
print(response_1)
305+
306+
response_2 = chain.invoke("For the transformer to English constituency results, what was the 'WSJ 23 F1' value for 'Dyer et al. (2016) (5]'?")
307+
print("\n***********New Unstructured Basic Query Engine***********")
308+
print(response_2)
309+
310+
response_3 = chain.invoke("When was George Washington born?")
311+
print("\n***********New Unstructured Basic Query Engine***********")
312+
print(response_3)
313+
314+
----
315+
====
316+

docs/modules/examples/pages/llama-parse-astra.adoc

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Create a `.env` file in your application with the following environment variable
3030
[source,bash]
3131
----
3232
LLAMA_CLOUD_API_KEY=llx-...
33-
ASTRA_DB_API_ENDPOINT=https://bbe07f45-8ab4-4d81-aa7d-7f58dbed3ead-us-east-1.apps.astra.datastax.com
33+
ASTRA_DB_API_ENDPOINT=https://<ASTRA_DB_ID>-<ASTRA_DB_REGION>.apps.astra.datastax.com
3434
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
3535
OPENAI_API_KEY=sk-...
3636
----
@@ -141,9 +141,9 @@ This query should return `The context does not provide information about the col
141141
[source,python]
142142
----
143143
query = "What is the color of the sky?"
144-
response_1 = query_engine.query(query)
144+
response_2 = query_engine.query(query)
145145
print("\n***********New LlamaParse+ Basic Query Engine***********")
146-
print(response_1)
146+
print(response_2)
147147
----
148148

149149
== Complete code
@@ -222,9 +222,9 @@ print(response_1)
222222
223223
# Query for an example with expected lack of context
224224
query = "What is the color of the sky?"
225-
response_1 = query_engine.query(query)
225+
response_2 = query_engine.query(query)
226226
print("\n***********New LlamaParse+ Basic Query Engine***********")
227-
print(response_1)
227+
print(response_2)
228228
----
229229
====
230230

0 commit comments

Comments
 (0)