Skip to content

Commit 85f488c

Browse files
committed
Added lab 5 instructions
1 parent e78fbb5 commit 85f488c

File tree

2 files changed

+312
-20
lines changed

2 files changed

+312
-20
lines changed

labs/05-vector_bd.md

Lines changed: 0 additions & 20 deletions
This file was deleted.

labs/05-vector_db.md

Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
# Lab 5: Vector DB
2+
3+
In this lab, we will use a vector store to retrieve recommendations from the resumes
4+
5+
When the user applies we will put the resume in the vector store
6+
7+
Then the user can click a recommend button on the UI and the api will retrieve a recommended resume from the database, even if the person has not applied directly for the job.
8+
9+
There are two parts to the lab -- ingestion and retrieval. This lab we will do ingestion. Next lab is retrieval.
10+
11+
We will use **Qdrant** vector DB, mainly because it is easy to set up.
12+
13+
Documentation on Qdrant with Langchain - https://docs.langchain.com/oss/python/integrations/vectorstores/qdrant
14+
15+
## Setup pre-work
16+
17+
Update `requirements.txt`:
18+
19+
```
20+
langchain-qdrant==1.1.0 # Langchain with Qdrant vector DB
21+
pypdf==6.4.0 # PDF to Text
22+
```
23+
24+
Then install dependencies
25+
26+
Mac:
27+
```
28+
> source ./.venv/bin/activate
29+
> pip uninstall PyPDF2
30+
> pip install -r requirements.txt
31+
```
32+
33+
Windows:
34+
```
35+
> .\.venv\Scripts\activate
36+
> pip uninstall PyPDF2
37+
> pip install -r requirements.txt
38+
```
39+
40+
Update `converter.py` as follows
41+
42+
```python
43+
from pypdf import PdfReader
44+
```
45+
46+
Now, let us create a Qdrant DB.
47+
48+
Put this code in a script and run it
49+
50+
```python
51+
from langchain_openai import OpenAIEmbeddings
52+
from langchain_qdrant import QdrantVectorStore
53+
from qdrant_client import QdrantClient
54+
from qdrant_client.http.models import Distance, VectorParams
55+
from config import settings
56+
57+
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY)
58+
client = QdrantClient(path="qdrant_store")
59+
client.create_collection(collection_name="resumes", vectors_config=VectorParams(size=3072, distance=Distance.COSINE))
60+
client.close()
61+
```
62+
63+
This will create a folder `qdrant_store` which will have our vector db files.
64+
65+
Now lets implement the feature. We will write pyunit tests as we go along.
66+
67+
## High Level Approach
68+
69+
1. Create a new test file `test/test_recommendations`
70+
1. Get the test resumes and put them in `test/resumes`
71+
1. Use this command to run the tests: `pytest .\test\test_recommendation.py` (use whichever slash as per your OS)
72+
1. In `ai.py` create the below function. This will create a Qdrant vector store
73+
74+
```python
75+
def get_vector_store():
76+
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY)
77+
vector_store = QdrantVectorStore.from_existing_collection(embedding=embeddings, collection_name="resumes", path="qdrant_store")
78+
return vector_store
79+
```
80+
1. But while testing, we would like to use an **in memory** vector store and not this one (Why?). So in `ai.py` create a unction `def inmemory_vector_store()` which will conigure qdrant and langchain with an in memory store and yield it. Close the client after yielding, else you will get errors
81+
1. In `conftest.py`, create a pytest fixture `vector_store` which will return the in memory store from `ai.py` (use syntax `yield from inmemory_vector_store()`)
82+
1. Write a test `test_should_embed_text_and_add_to_vector_db`. The test uses the above fixture
83+
84+
```python
85+
def test_should_embed_text_and_add_to_vector_db(vector_store):
86+
ingest_resume("Siddharta\nSiddharta is an AI trainer", "siddharta.pdf", 1, vector_store)
87+
retriever = vector_store.as_retriever(search_kwargs={"k": 1})
88+
result = retriever.invoke("I am looking for an AI trainer")
89+
assert len(result) == 1
90+
assert "Siddharta" in result[0].page_content
91+
assert result[0].metadata['_id'] == 1
92+
```
93+
94+
1. Create a function `ingest_resume` in `ai.py` and make this test pass. Put filename in the metadata field `url` and set the `id` field to what is passed in. Parameters are:
95+
- resume text (string)
96+
- filename
97+
- resume id
98+
- vector store object
99+
100+
1. Now we need to create the background task that will do this operation. Write this test
101+
102+
```python
103+
def test_background_task(vector_store):
104+
filename = "test/resumes/ProfileAndrewNg.pdf"
105+
with open(filename, "rb") as f:
106+
content = f.read()
107+
ingest_resume_for_recommendataions(content, filename, resume_id=1, vector_store=vector_store)
108+
retriever = vector_store.as_retriever(search_kwargs={"k": 1})
109+
result = retriever.invoke("I am looking for an AI trainer")
110+
assert "Andrew" in result[0].page_content
111+
```
112+
113+
1. Create the function `ingest_resume_for_recommendation` in `main.py` and make the test pass. It should use the `ingest_resume` function we created above. Parameters are
114+
- resume content (PDF format bytes, NOT string)
115+
- filename
116+
- resume id
117+
- vector store
118+
119+
1. Now let us test the ingestion functionality. Write a test `def test_retrieval_quality(vector_store)`
120+
- It should ingest all the resumes in `test/resumes` by calling the background task above
121+
- Then retrieve the following text and verify the right resume is returned
122+
- "I am looking for an expert in AI" --> Andrew
123+
- "I am looking for an expert in Linux" --> Linus
124+
- "I am looking for an expert in Javascript" --> Yuxi
125+
- "I am looking for a generalist who can work in python and typescript" --> Koudai
126+
- "I am looking for a data journalist" --> Simon
127+
128+
1. Now we are going to test the API. This is the code for it. It is a little complex, so read it carefully
129+
130+
```python
131+
def test_job_application_api(db_session, vector_store, client):
132+
job_board = JobBoard(slug="test", logo_url="http://example.com")
133+
db_session.add(job_board)
134+
db_session.commit()
135+
db_session.refresh(job_board)
136+
job_post = JobPost(title="AI Engineer",
137+
description="Need an AI Engineer",
138+
job_board_id = job_board.id)
139+
db_session.add(job_post)
140+
db_session.commit()
141+
db_session.refresh(job_post)
142+
filename = "test/resumes/ProfileAndrewNg.pdf"
143+
post_data = {
144+
"first_name": "Siddharta",
145+
"last_name": "Govindaraj",
146+
"email": "siddharta@gmail.com",
147+
"job_post_id": job_post.id
148+
}
149+
with open(filename, "rb") as f:
150+
response = client.post("/api/job-applications", data=post_data, files={"resume": ("ProfileAndrewNg.pdf", f, "application/pdf")})
151+
assert response.status_code == 200
152+
retriever = vector_store.as_retriever(search_kwargs={"k": 1})
153+
result = retriever.invoke("I am looking for an expert in AI")
154+
assert "Andrew" in result[0].page_content
155+
assert result[0].metadata["_id"] == job_post.id
156+
```
157+
158+
1. Before the test above can pass, we need to do a few things. Make all the required changes and make the test pass
159+
- In `main.py`, find the relevant api endpoint and schedule the background task that we created before
160+
- The background task needs vector store as a parameter. Add it as a dependency to the endpoint function (refer how db has been configured). It should be configured with the *real* vector store
161+
- In the test case, we want to use the in memory vector store, so we will need to override the real vector store with the in memory one. Again, refer how we are replacing the real db session with the testcontainer one
162+
163+
## Hints
164+
165+
### How can I create the in memory vector store?
166+
167+
<details>
168+
<summary>Answer</summary>
169+
170+
```python
171+
def inmemory_vector_store():
172+
embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY)
173+
client = QdrantClient(":memory:")
174+
client.create_collection(collection_name="resumes", vectors_config=VectorParams(size=3072, distance=Distance.COSINE))
175+
vector_store = QdrantVectorStore(client=client, collection_name="resumes", embedding=embeddings)
176+
try:
177+
yield vector_store
178+
finally:
179+
client.close()
180+
```
181+
</details>
182+
183+
### How do I create the vector store fixture
184+
185+
<details>
186+
<summary>Answer</summary>
187+
188+
```python
189+
@pytest.fixture(scope="function")
190+
def vector_store():
191+
yield from inmemory_vector_store()
192+
```
193+
</details>
194+
195+
### What should I do in ingest_resume?
196+
197+
<details>
198+
<summary>Hint</summary>
199+
200+
Create a langchain `Document` object with the given parameters and use `add_documents` function on the store to add it. Set a field `url` in the metadata
201+
202+
Note that we are adding a single document, but `add_documents` requires a list. Pass a list of one document.
203+
`add_documents` also takes a list of `ids`
204+
</details>
205+
206+
<details>
207+
<summary>Answer</summary>
208+
209+
```python
210+
def ingest_resume(resume_text, resume_url, resume_id, vector_store):
211+
doc = Document(page_content=resume_text, metadata={"url": resume_url})
212+
vector_store.add_documents(documents=[doc], ids=[resume_id])
213+
```
214+
</details>
215+
216+
### How do I create the background task?
217+
218+
<details>
219+
<summary>Hint</summary>
220+
221+
Use the function `extract_text_from_pdf_bytes` in `converter.py`. Then use `ingest_resume` above
222+
</details>
223+
224+
<details>
225+
<summary>Answer</summary>
226+
227+
```python
228+
def ingest_resume_for_recommendataions(resume_content, resume_url, resume_id, vector_store):
229+
resume_raw_text = extract_text_from_pdf_bytes(resume_content)
230+
ingest_resume(resume_raw_text, resume_url, resume_id, vector_store)
231+
```
232+
</details>
233+
234+
### How do I test retrieval quality?
235+
236+
<details>
237+
<summary>Answer</summary>
238+
239+
```python
240+
def test_retrieval_quality(vector_store):
241+
for id, filename in enumerate(os.listdir("test/resumes")):
242+
with open(f"test/resumes/{filename}", "rb") as f:
243+
content = f.read()
244+
ingest_resume_for_recommendataions(content, filename, resume_id=id, vector_store=vector_store)
245+
retriever = vector_store.as_retriever(search_kwargs={"k": 1})
246+
247+
result = retriever.invoke("I am looking for an expert in AI")
248+
assert "Andrew" in result[0].page_content
249+
result = retriever.invoke("I am looking for an expert in Linux")
250+
assert "Linus" in result[0].page_content
251+
result = retriever.invoke("I am looking for an expert in Javascript")
252+
assert "Yuxi" in result[0].page_content
253+
result = retriever.invoke("I am looking for a generalist who can work in python and typescript")
254+
assert "Koudai" in result[0].page_content
255+
result = retriever.invoke("I am looking for a data journalist")
256+
assert "Simon" in result[0].page_content
257+
```
258+
</details>
259+
260+
### How do I specify vector store as a dependency?
261+
262+
<details>
263+
<summary>Answer</summary>
264+
265+
Update the parameters for `api_create_new_job_application` like this
266+
267+
```python
268+
async def api_create_new_job_application(
269+
job_application_form: Annotated[JobApplicationForm, Form()],
270+
background_tasks: BackgroundTasks,
271+
db: Session = Depends(get_db),
272+
vector_store = Depends(get_vector_store)):
273+
```
274+
</details>
275+
276+
### How do I schedule the background task?
277+
278+
<details>
279+
<summary>Answer</summary>
280+
281+
```python
282+
background_tasks.add_task(ingest_resume_for_recommendataions, resume_content,
283+
file_url, new_job_application.id, vector_store)
284+
```
285+
</details>
286+
287+
### How do I override the real vector store with the in memory one during the api test?
288+
289+
<details>
290+
<summary>Answer</summary>
291+
292+
Update the `client` fixture in `conftest.py`
293+
294+
```python
295+
@pytest.fixture(scope="function")
296+
def client(db_session, vector_store):
297+
def override_get_db():
298+
yield db_session
299+
300+
def override_vector_store():
301+
yield vector_store
302+
303+
app.dependency_overrides[get_db] = override_get_db
304+
app.dependency_overrides[get_vector_store] = override_vector_store
305+
306+
try:
307+
with TestClient(app) as test_client:
308+
yield test_client
309+
finally:
310+
app.dependency_overrides.clear()
311+
```
312+
</details>

0 commit comments

Comments
 (0)