|
| 1 | +# Lab 5: Vector DB |
| 2 | + |
| 3 | +In this lab, we will use a vector store to retrieve recommendations from the resumes |
| 4 | + |
| 5 | +When the user applies we will put the resume in the vector store |
| 6 | + |
| 7 | +Then the user can click a recommend button on the UI and the api will retrieve a recommended resume from the database, even if the person has not applied directly for the job. |
| 8 | + |
| 9 | +There are two parts to the lab -- ingestion and retrieval. This lab we will do ingestion. Next lab is retrieval. |
| 10 | + |
| 11 | +We will use **Qdrant** vector DB, mainly because it is easy to set up. |
| 12 | + |
| 13 | +Documentation on Qdrant with Langchain - https://docs.langchain.com/oss/python/integrations/vectorstores/qdrant |
| 14 | + |
| 15 | +## Setup pre-work |
| 16 | + |
| 17 | +Update `requirements.txt`: |
| 18 | + |
| 19 | +``` |
| 20 | +langchain-qdrant==1.1.0 # Langchain with Qdrant vector DB |
| 21 | +pypdf==6.4.0 # PDF to Text |
| 22 | +``` |
| 23 | + |
| 24 | +Then install dependencies |
| 25 | + |
| 26 | +Mac: |
| 27 | +``` |
| 28 | +> source ./.venv/bin/activate |
| 29 | +> pip uninstall PyPDF2 |
| 30 | +> pip install -r requirements.txt |
| 31 | +``` |
| 32 | + |
| 33 | +Windows: |
| 34 | +``` |
| 35 | +> .\.venv\Scripts\activate |
| 36 | +> pip uninstall PyPDF2 |
| 37 | +> pip install -r requirements.txt |
| 38 | +``` |
| 39 | + |
| 40 | +Update `converter.py` as follows |
| 41 | + |
| 42 | +```python |
| 43 | +from pypdf import PdfReader |
| 44 | +``` |
| 45 | + |
| 46 | +Now, let us create a Qdrant DB. |
| 47 | + |
| 48 | +Put this code in a script and run it |
| 49 | + |
| 50 | +```python |
| 51 | +from langchain_openai import OpenAIEmbeddings |
| 52 | +from langchain_qdrant import QdrantVectorStore |
| 53 | +from qdrant_client import QdrantClient |
| 54 | +from qdrant_client.http.models import Distance, VectorParams |
| 55 | +from config import settings |
| 56 | + |
| 57 | +embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY) |
| 58 | +client = QdrantClient(path="qdrant_store") |
| 59 | +client.create_collection(collection_name="resumes", vectors_config=VectorParams(size=3072, distance=Distance.COSINE)) |
| 60 | +client.close() |
| 61 | +``` |
| 62 | + |
| 63 | +This will create a folder `qdrant_store` which will have our vector db files. |
| 64 | + |
| 65 | +Now lets implement the feature. We will write pyunit tests as we go along. |
| 66 | + |
| 67 | +## High Level Approach |
| 68 | + |
| 69 | +1. Create a new test file `test/test_recommendations` |
| 70 | +1. Get the test resumes and put them in `test/resumes` |
| 71 | +1. Use this command to run the tests: `pytest .\test\test_recommendation.py` (use whichever slash as per your OS) |
| 72 | +1. In `ai.py` create the below function. This will create a Qdrant vector store |
| 73 | + |
| 74 | +```python |
| 75 | +def get_vector_store(): |
| 76 | + embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY) |
| 77 | + vector_store = QdrantVectorStore.from_existing_collection(embedding=embeddings, collection_name="resumes", path="qdrant_store") |
| 78 | + return vector_store |
| 79 | +``` |
| 80 | +1. But while testing, we would like to use an **in memory** vector store and not this one (Why?). So in `ai.py` create a unction `def inmemory_vector_store()` which will conigure qdrant and langchain with an in memory store and yield it. Close the client after yielding, else you will get errors |
| 81 | +1. In `conftest.py`, create a pytest fixture `vector_store` which will return the in memory store from `ai.py` (use syntax `yield from inmemory_vector_store()`) |
| 82 | +1. Write a test `test_should_embed_text_and_add_to_vector_db`. The test uses the above fixture |
| 83 | + |
| 84 | +```python |
| 85 | +def test_should_embed_text_and_add_to_vector_db(vector_store): |
| 86 | + ingest_resume("Siddharta\nSiddharta is an AI trainer", "siddharta.pdf", 1, vector_store) |
| 87 | + retriever = vector_store.as_retriever(search_kwargs={"k": 1}) |
| 88 | + result = retriever.invoke("I am looking for an AI trainer") |
| 89 | + assert len(result) == 1 |
| 90 | + assert "Siddharta" in result[0].page_content |
| 91 | + assert result[0].metadata['_id'] == 1 |
| 92 | +``` |
| 93 | + |
| 94 | +1. Create a function `ingest_resume` in `ai.py` and make this test pass. Put filename in the metadata field `url` and set the `id` field to what is passed in. Parameters are: |
| 95 | + - resume text (string) |
| 96 | + - filename |
| 97 | + - resume id |
| 98 | + - vector store object |
| 99 | + |
| 100 | +1. Now we need to create the background task that will do this operation. Write this test |
| 101 | + |
| 102 | +```python |
| 103 | +def test_background_task(vector_store): |
| 104 | + filename = "test/resumes/ProfileAndrewNg.pdf" |
| 105 | + with open(filename, "rb") as f: |
| 106 | + content = f.read() |
| 107 | + ingest_resume_for_recommendataions(content, filename, resume_id=1, vector_store=vector_store) |
| 108 | + retriever = vector_store.as_retriever(search_kwargs={"k": 1}) |
| 109 | + result = retriever.invoke("I am looking for an AI trainer") |
| 110 | + assert "Andrew" in result[0].page_content |
| 111 | +``` |
| 112 | + |
| 113 | +1. Create the function `ingest_resume_for_recommendation` in `main.py` and make the test pass. It should use the `ingest_resume` function we created above. Parameters are |
| 114 | + - resume content (PDF format bytes, NOT string) |
| 115 | + - filename |
| 116 | + - resume id |
| 117 | + - vector store |
| 118 | + |
| 119 | +1. Now let us test the ingestion functionality. Write a test `def test_retrieval_quality(vector_store)` |
| 120 | + - It should ingest all the resumes in `test/resumes` by calling the background task above |
| 121 | + - Then retrieve the following text and verify the right resume is returned |
| 122 | + - "I am looking for an expert in AI" --> Andrew |
| 123 | + - "I am looking for an expert in Linux" --> Linus |
| 124 | + - "I am looking for an expert in Javascript" --> Yuxi |
| 125 | + - "I am looking for a generalist who can work in python and typescript" --> Koudai |
| 126 | + - "I am looking for a data journalist" --> Simon |
| 127 | + |
| 128 | +1. Now we are going to test the API. This is the code for it. It is a little complex, so read it carefully |
| 129 | + |
| 130 | +```python |
| 131 | +def test_job_application_api(db_session, vector_store, client): |
| 132 | + job_board = JobBoard(slug="test", logo_url="http://example.com") |
| 133 | + db_session.add(job_board) |
| 134 | + db_session.commit() |
| 135 | + db_session.refresh(job_board) |
| 136 | + job_post = JobPost(title="AI Engineer", |
| 137 | + description="Need an AI Engineer", |
| 138 | + job_board_id = job_board.id) |
| 139 | + db_session.add(job_post) |
| 140 | + db_session.commit() |
| 141 | + db_session.refresh(job_post) |
| 142 | + filename = "test/resumes/ProfileAndrewNg.pdf" |
| 143 | + post_data = { |
| 144 | + "first_name": "Siddharta", |
| 145 | + "last_name": "Govindaraj", |
| 146 | + "email": "siddharta@gmail.com", |
| 147 | + "job_post_id": job_post.id |
| 148 | + } |
| 149 | + with open(filename, "rb") as f: |
| 150 | + response = client.post("/api/job-applications", data=post_data, files={"resume": ("ProfileAndrewNg.pdf", f, "application/pdf")}) |
| 151 | + assert response.status_code == 200 |
| 152 | + retriever = vector_store.as_retriever(search_kwargs={"k": 1}) |
| 153 | + result = retriever.invoke("I am looking for an expert in AI") |
| 154 | + assert "Andrew" in result[0].page_content |
| 155 | + assert result[0].metadata["_id"] == job_post.id |
| 156 | +``` |
| 157 | + |
| 158 | +1. Before the test above can pass, we need to do a few things. Make all the required changes and make the test pass |
| 159 | + - In `main.py`, find the relevant api endpoint and schedule the background task that we created before |
| 160 | + - The background task needs vector store as a parameter. Add it as a dependency to the endpoint function (refer how db has been configured). It should be configured with the *real* vector store |
| 161 | + - In the test case, we want to use the in memory vector store, so we will need to override the real vector store with the in memory one. Again, refer how we are replacing the real db session with the testcontainer one |
| 162 | + |
| 163 | +## Hints |
| 164 | + |
| 165 | +### How can I create the in memory vector store? |
| 166 | + |
| 167 | +<details> |
| 168 | +<summary>Answer</summary> |
| 169 | + |
| 170 | +```python |
| 171 | +def inmemory_vector_store(): |
| 172 | + embeddings = OpenAIEmbeddings(model="text-embedding-3-large", api_key=settings.OPENAI_API_KEY) |
| 173 | + client = QdrantClient(":memory:") |
| 174 | + client.create_collection(collection_name="resumes", vectors_config=VectorParams(size=3072, distance=Distance.COSINE)) |
| 175 | + vector_store = QdrantVectorStore(client=client, collection_name="resumes", embedding=embeddings) |
| 176 | + try: |
| 177 | + yield vector_store |
| 178 | + finally: |
| 179 | + client.close() |
| 180 | +``` |
| 181 | +</details> |
| 182 | + |
| 183 | +### How do I create the vector store fixture |
| 184 | + |
| 185 | +<details> |
| 186 | +<summary>Answer</summary> |
| 187 | + |
| 188 | +```python |
| 189 | +@pytest.fixture(scope="function") |
| 190 | +def vector_store(): |
| 191 | + yield from inmemory_vector_store() |
| 192 | +``` |
| 193 | +</details> |
| 194 | + |
| 195 | +### What should I do in ingest_resume? |
| 196 | + |
| 197 | +<details> |
| 198 | +<summary>Hint</summary> |
| 199 | + |
| 200 | +Create a langchain `Document` object with the given parameters and use `add_documents` function on the store to add it. Set a field `url` in the metadata |
| 201 | + |
| 202 | +Note that we are adding a single document, but `add_documents` requires a list. Pass a list of one document. |
| 203 | +`add_documents` also takes a list of `ids` |
| 204 | +</details> |
| 205 | + |
| 206 | +<details> |
| 207 | +<summary>Answer</summary> |
| 208 | + |
| 209 | +```python |
| 210 | +def ingest_resume(resume_text, resume_url, resume_id, vector_store): |
| 211 | + doc = Document(page_content=resume_text, metadata={"url": resume_url}) |
| 212 | + vector_store.add_documents(documents=[doc], ids=[resume_id]) |
| 213 | +``` |
| 214 | +</details> |
| 215 | + |
| 216 | +### How do I create the background task? |
| 217 | + |
| 218 | +<details> |
| 219 | +<summary>Hint</summary> |
| 220 | + |
| 221 | +Use the function `extract_text_from_pdf_bytes` in `converter.py`. Then use `ingest_resume` above |
| 222 | +</details> |
| 223 | + |
| 224 | +<details> |
| 225 | +<summary>Answer</summary> |
| 226 | + |
| 227 | +```python |
| 228 | +def ingest_resume_for_recommendataions(resume_content, resume_url, resume_id, vector_store): |
| 229 | + resume_raw_text = extract_text_from_pdf_bytes(resume_content) |
| 230 | + ingest_resume(resume_raw_text, resume_url, resume_id, vector_store) |
| 231 | +``` |
| 232 | +</details> |
| 233 | + |
| 234 | +### How do I test retrieval quality? |
| 235 | + |
| 236 | +<details> |
| 237 | +<summary>Answer</summary> |
| 238 | + |
| 239 | +```python |
| 240 | +def test_retrieval_quality(vector_store): |
| 241 | + for id, filename in enumerate(os.listdir("test/resumes")): |
| 242 | + with open(f"test/resumes/{filename}", "rb") as f: |
| 243 | + content = f.read() |
| 244 | + ingest_resume_for_recommendataions(content, filename, resume_id=id, vector_store=vector_store) |
| 245 | + retriever = vector_store.as_retriever(search_kwargs={"k": 1}) |
| 246 | + |
| 247 | + result = retriever.invoke("I am looking for an expert in AI") |
| 248 | + assert "Andrew" in result[0].page_content |
| 249 | + result = retriever.invoke("I am looking for an expert in Linux") |
| 250 | + assert "Linus" in result[0].page_content |
| 251 | + result = retriever.invoke("I am looking for an expert in Javascript") |
| 252 | + assert "Yuxi" in result[0].page_content |
| 253 | + result = retriever.invoke("I am looking for a generalist who can work in python and typescript") |
| 254 | + assert "Koudai" in result[0].page_content |
| 255 | + result = retriever.invoke("I am looking for a data journalist") |
| 256 | + assert "Simon" in result[0].page_content |
| 257 | +``` |
| 258 | +</details> |
| 259 | + |
| 260 | +### How do I specify vector store as a dependency? |
| 261 | + |
| 262 | +<details> |
| 263 | +<summary>Answer</summary> |
| 264 | + |
| 265 | +Update the parameters for `api_create_new_job_application` like this |
| 266 | + |
| 267 | +```python |
| 268 | +async def api_create_new_job_application( |
| 269 | + job_application_form: Annotated[JobApplicationForm, Form()], |
| 270 | + background_tasks: BackgroundTasks, |
| 271 | + db: Session = Depends(get_db), |
| 272 | + vector_store = Depends(get_vector_store)): |
| 273 | +``` |
| 274 | +</details> |
| 275 | + |
| 276 | +### How do I schedule the background task? |
| 277 | + |
| 278 | +<details> |
| 279 | +<summary>Answer</summary> |
| 280 | + |
| 281 | +```python |
| 282 | + background_tasks.add_task(ingest_resume_for_recommendataions, resume_content, |
| 283 | + file_url, new_job_application.id, vector_store) |
| 284 | +``` |
| 285 | +</details> |
| 286 | + |
| 287 | +### How do I override the real vector store with the in memory one during the api test? |
| 288 | + |
| 289 | +<details> |
| 290 | +<summary>Answer</summary> |
| 291 | + |
| 292 | +Update the `client` fixture in `conftest.py` |
| 293 | + |
| 294 | +```python |
| 295 | +@pytest.fixture(scope="function") |
| 296 | +def client(db_session, vector_store): |
| 297 | + def override_get_db(): |
| 298 | + yield db_session |
| 299 | + |
| 300 | + def override_vector_store(): |
| 301 | + yield vector_store |
| 302 | + |
| 303 | + app.dependency_overrides[get_db] = override_get_db |
| 304 | + app.dependency_overrides[get_vector_store] = override_vector_store |
| 305 | + |
| 306 | + try: |
| 307 | + with TestClient(app) as test_client: |
| 308 | + yield test_client |
| 309 | + finally: |
| 310 | + app.dependency_overrides.clear() |
| 311 | +``` |
| 312 | +</details> |
0 commit comments