Skip to content

Commit 7776f1a

Browse files
authored
Merge pull request #62 from marklogic/feature/langchain-example
Example integration with langchain
2 parents 7080987 + 3205914 commit 7776f1a

File tree

14 files changed

+501
-0
lines changed

14 files changed

+501
-0
lines changed

examples/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
.ipynb_checkpoints
2+
.env

examples/langchain/.gitignore

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
docker
2+
.gradle
3+
build
4+

examples/langchain/README.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Example langchain retriever
2+
3+
This project demonstrates one approach for implementing a
4+
[langchain retriever](https://python.langchain.com/docs/modules/data_connection/)
5+
that allows for
6+
[Retrieval Augmented Generation (RAG)](https://python.langchain.com/docs/use_cases/question_answering/)
7+
to be supported via MarkLogic and the MarkLogic Python Client. This example uses the same data as in
8+
[the langchain RAG quickstart guide](https://python.langchain.com/docs/use_cases/question_answering/quickstart),
9+
but with the data having first been loaded into MarkLogic.
10+
11+
**This is only intended as an example** of how easily a langchain retriever can be developed
12+
using the MarkLogic Python Client. The queries in this example are simple and naturally
13+
do not have any knowledge of how your data is modeled in MarkLogic. You are encouraged to use
14+
this as an example for developing your own retriever, where you can build a query based on a
15+
question submitted to langchain that fully leverages the indexes and data models in your MarkLogic
16+
application. Additionally, please see the
17+
[langchain documentation on splitting text](https://python.langchain.com/docs/modules/data_connection/document_transformers/). You may need to restructure your data so that you have a larger number of
18+
smaller documents in your database so that you do not exceed the limit that langchain imposes on how
19+
much data a retriever can return.
20+
21+
# Setup
22+
23+
To try out this project, use [docker-compose](https://docs.docker.com/compose/) to instantiate a new MarkLogic
24+
instance with port 8003 available (you can use your own MarkLogic instance too, just be sure that port 8003
25+
is available):
26+
27+
docker-compose up -d --build
28+
29+
Then deploy a small REST API application to MarkLogic, which includes a basic non-admin MarkLogic user
30+
named `langchain-user`:
31+
32+
./gradlew -i mlDeploy
33+
34+
Next, create a new Python virtual environment - [pyenv](https://github.com/pyenv/pyenv) is recommended for this -
35+
and install the
36+
[langchain example dependencies](https://python.langchain.com/docs/use_cases/question_answering/quickstart#dependencies),
37+
along with the MarkLogic Python Client:
38+
39+
pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client
40+
41+
Then run the following Python program to load text data from the langchain quickstart guide
42+
into two different collections in the `langchain-test-content` database:
43+
44+
python load_data.py
45+
46+
Create a ".env" file to hold your OpenAI API key:
47+
48+
echo "OPENAI_API_KEY=<your key here>" > .env
49+
50+
# Testing the retriever
51+
52+
You are now ready to test the example retriever. Run the following to ask a question with the
53+
results augmented via the `marklogic_retriever.py` module in this project; you will be
54+
prompted for an OpenAI API key when you run this, which you can type or paste in:
55+
56+
python ask.py "What is task decomposition?" posts
57+
58+
The retriever uses a [cts.similarQuery](https://docs.marklogic.com/cts.similarQuery) to select from the documents
59+
loaded via `load_data.py`. It defaults to a page length of 10. You can change this by providing a command line
60+
argument - e.g.:
61+
62+
python ask.py "What is task decomposition?" posts 15
63+
64+
Example of a question for the "sotu" (State of the Union speech) collection:
65+
66+
python ask.py "What are economic sanctions?" sotu 20
67+
68+
To use a word query instead of a similar query, along with a set of drop words, specify "word" as the 4th argument:
69+
70+
python ask.py "What are economic sanctions?" sotu 20 word

examples/langchain/ask.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Based on example at
2+
# https://python.langchain.com/docs/use_cases/question_answering/quickstart .
3+
4+
import sys
5+
from dotenv import load_dotenv
6+
from langchain import hub
7+
from langchain_openai import ChatOpenAI
8+
from langchain.schema import StrOutputParser
9+
from langchain.schema.runnable import RunnablePassthrough
10+
from marklogic import Client
11+
from marklogic_retriever import MarkLogicRetriever
12+
13+
14+
def format_docs(docs):
15+
return "\n\n".join(doc.page_content for doc in docs)
16+
17+
18+
question = sys.argv[1]
19+
20+
retriever = MarkLogicRetriever.create(
21+
Client("http://localhost:8003", digest=("langchain-user", "password"))
22+
)
23+
retriever.collections = [sys.argv[2]]
24+
retriever.max_results = int(sys.argv[3]) if len(sys.argv) > 3 else 10
25+
if len(sys.argv) > 4:
26+
retriever.query_type = sys.argv[4]
27+
28+
load_dotenv()
29+
30+
prompt = hub.pull("rlm/rag-prompt")
31+
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
32+
33+
rag_chain = (
34+
{"context": retriever | format_docs, "question": RunnablePassthrough()}
35+
| prompt | llm | StrOutputParser()
36+
)
37+
print(rag_chain.invoke(question))

examples/langchain/build.gradle

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
plugins {
2+
id "net.saliman.properties" version "1.5.2"
3+
id "com.marklogic.ml-gradle" version "4.6.0"
4+
}
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
version: '3.8'
2+
name: marklogic_langchain
3+
4+
services:
5+
6+
marklogic:
7+
image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0"
8+
platform: linux/amd64
9+
environment:
10+
- MARKLOGIC_INIT=true
11+
- MARKLOGIC_ADMIN_USERNAME=admin
12+
- MARKLOGIC_ADMIN_PASSWORD=admin
13+
volumes:
14+
- ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
15+
ports:
16+
- "8000-8003:8000-8003"
17+
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
mlAppName=langchain-test
2+
mlRestPort=8003
3+
mlUsername=admin
4+
mlPassword=admin
59.3 KB
Binary file not shown.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
#Tue Mar 22 14:27:38 EDT 2016
2+
distributionBase=GRADLE_USER_HOME
3+
distributionPath=wrapper/dists
4+
zipStoreBase=GRADLE_USER_HOME
5+
zipStorePath=wrapper/dists
6+
distributionUrl=https\://services.gradle.org/distributions/gradle-8.4-bin.zip

examples/langchain/gradlew

Lines changed: 160 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)