|
| 1 | +# 📊 Embedding Model & PDF Chunking Analysis |
| 2 | + |
| 3 | +## 🔍 Where is the EmbeddingModel Used? |
| 4 | + |
| 5 | +### 1. **VectorStoreFactory.java** (Line 26) |
| 6 | +```java |
| 7 | +@RequiredArgsConstructor |
| 8 | +public class VectorStoreFactory { |
| 9 | + private final QdrantClient qdrantClient; |
| 10 | + private final TopicConfig topicConfig; |
| 11 | + private final EmbeddingModel embeddingModel; // ← Injected by Spring |
| 12 | +} |
| 13 | +``` |
| 14 | + |
| 15 | +**Purpose:** The `EmbeddingModel` is injected into the factory and used to create vector stores. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +### 2. **Creating QdrantVectorStore** (VectorStoreFactory.java, Line 46-48) |
| 20 | +```java |
| 21 | +QdrantVectorStore vectorStore = QdrantVectorStore.builder(qdrantClient, embeddingModel) |
| 22 | + .collectionName(collectionName) |
| 23 | + .build(); |
| 24 | +``` |
| 25 | + |
| 26 | +**What happens here:** |
| 27 | +- The `embeddingModel` is passed to `QdrantVectorStore.builder()` |
| 28 | +- This embedding model is used **internally** by Spring AI to convert text into vectors |
| 29 | + |
| 30 | +--- |
| 31 | + |
| 32 | +### 3. **When Embeddings are Generated** (Automatic) |
| 33 | + |
| 34 | +#### A) **During Document Upload** (TopicDocumentService.java, Line 107) |
| 35 | +```java |
| 36 | +VectorStore topicVectorStore = vectorStoreFactory.getVectorStore(topic); |
| 37 | +topicVectorStore.add(splitDocuments); // ← Embeddings generated HERE! |
| 38 | +``` |
| 39 | + |
| 40 | +**Process:** |
| 41 | +1. You upload a PDF |
| 42 | +2. PDF gets chunked (see below) |
| 43 | +3. `vectorStore.add()` is called |
| 44 | +4. **Spring AI automatically calls the EmbeddingModel** to convert each text chunk into a vector |
| 45 | +5. Vectors are stored in Qdrant with metadata |
| 46 | + |
| 47 | +#### B) **During Search Queries** (TopicRagService.java, Line 53) |
| 48 | +```java |
| 49 | +List<Document> relevantDocs = topicVectorStore.similaritySearch(searchRequest); |
| 50 | +``` |
| 51 | + |
| 52 | +**Process:** |
| 53 | +1. User sends a query: "What is penetration testing?" |
| 54 | +2. **Spring AI automatically calls the EmbeddingModel** to convert the query into a vector |
| 55 | +3. Qdrant searches for similar vectors (cosine similarity) |
| 56 | +4. Returns the most relevant document chunks |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## ✂️ Where PDFs Get Chunked |
| 61 | + |
| 62 | +### Location: **TopicDocumentService.java** |
| 63 | + |
| 64 | +#### **Step 1: Read PDF** (Lines 73-75) |
| 65 | +```java |
| 66 | +ByteArrayResource resource = new ByteArrayResource(pdfBytes); |
| 67 | +PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(resource); |
| 68 | +List<Document> documents = pdfReader.get(); // ← Reads PDF page by page |
| 69 | +``` |
| 70 | +- Uses **Spring AI's PagePdfDocumentReader** |
| 71 | +- Extracts text from each PDF page |
| 72 | +- Creates one `Document` object per page |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +#### **Step 2: Split into Chunks** (Lines 76-77) |
| 77 | +```java |
| 78 | +TokenTextSplitter splitter = new TokenTextSplitter(); |
| 79 | +List<Document> splitDocuments = splitter.split(documents); // ← CHUNKING HAPPENS HERE! |
| 80 | +``` |
| 81 | + |
| 82 | +**What TokenTextSplitter does:** |
| 83 | +- Takes the page-level documents |
| 84 | +- Splits them into **smaller chunks** based on token count |
| 85 | +- Default configuration (you can customize): |
| 86 | + - **Chunk size:** ~800 tokens (~3000 characters) |
| 87 | + - **Chunk overlap:** ~400 tokens (prevents losing context at boundaries) |
| 88 | + |
| 89 | +**Example:** |
| 90 | +``` |
| 91 | +Original PDF (3 pages): |
| 92 | +├── Page 1: 2000 tokens |
| 93 | +├── Page 2: 1800 tokens |
| 94 | +└── Page 3: 1500 tokens |
| 95 | +
|
| 96 | +After TokenTextSplitter: |
| 97 | +├── Chunk 0: 800 tokens (from Page 1, start) |
| 98 | +├── Chunk 1: 800 tokens (from Page 1, middle) [400 token overlap with Chunk 0] |
| 99 | +├── Chunk 2: 800 tokens (from Page 1 end + Page 2 start) |
| 100 | +├── Chunk 3: 800 tokens (from Page 2) |
| 101 | +├── Chunk 4: 800 tokens (from Page 2 end + Page 3 start) |
| 102 | +└── Chunk 5: 700 tokens (from Page 3, end) |
| 103 | +
|
| 104 | +Total: 6 chunks with overlaps for context preservation |
| 105 | +``` |
| 106 | + |
| 107 | +--- |
| 108 | + |
| 109 | +#### **Step 3: Enrich Metadata** (Lines 79-103) |
| 110 | +```java |
| 111 | +splitDocuments.forEach(doc -> { |
| 112 | + Map<String, Object> metadata = doc.getMetadata(); |
| 113 | + metadata.put("docId", docId); |
| 114 | + metadata.put("filename", filename); |
| 115 | + metadata.put("topic", topic); // ← Topic for routing |
| 116 | + metadata.put("chunkIndex", chunkIndex.getAndIncrement()); |
| 117 | + metadata.put("title", pdfMetadata.get("title")); |
| 118 | + metadata.put("author", pdfMetadata.get("author")); |
| 119 | + metadata.put("publishingYear", pdfMetadata.get("publishingYear")); |
| 120 | +}); |
| 121 | +``` |
| 122 | + |
| 123 | +Each chunk now has: |
| 124 | +- Document ID |
| 125 | +- Original filename |
| 126 | +- Topic (pentesting, iot, etc.) |
| 127 | +- Chunk index (for ordering) |
| 128 | +- PDF metadata (title, author, year) |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +#### **Step 4: Convert to Vectors & Store** (Lines 106-107) |
| 133 | +```java |
| 134 | +VectorStore topicVectorStore = vectorStoreFactory.getVectorStore(topic); |
| 135 | +topicVectorStore.add(splitDocuments); // ← Embeddings created & stored in Qdrant |
| 136 | +``` |
| 137 | + |
| 138 | +**Behind the scenes:** |
| 139 | +``` |
| 140 | +For each chunk: |
| 141 | +1. Text: "Penetration testing involves..." |
| 142 | +2. EmbeddingModel.embed(text) → [0.234, -0.123, 0.456, ..., 0.789] (768 dimensions) |
| 143 | +3. Store in Qdrant: |
| 144 | + - Vector: [0.234, -0.123, ...] |
| 145 | + - Metadata: {filename: "pentest.pdf", topic: "pentesting", chunkIndex: 0} |
| 146 | +``` |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +## 🔄 Complete Flow Diagram |
| 151 | + |
| 152 | +``` |
| 153 | +PDF Upload (pentest.pdf, 50 pages) |
| 154 | + ↓ |
| 155 | +[1] PagePdfDocumentReader.get() |
| 156 | + → Extract text from 50 pages |
| 157 | + → Create 50 Document objects |
| 158 | + ↓ |
| 159 | +[2] TokenTextSplitter.split() |
| 160 | + → Split 50 pages into ~200 chunks (depending on content) |
| 161 | + → Each chunk: ~800 tokens with 400 token overlap |
| 162 | + ↓ |
| 163 | +[3] Enrich with metadata |
| 164 | + → Add docId, filename, topic, chunkIndex, author, year |
| 165 | + ↓ |
| 166 | +[4] vectorStore.add(splitDocuments) |
| 167 | + → For each of 200 chunks: |
| 168 | + a) EmbeddingModel converts text → vector (768 dimensions) |
| 169 | + b) Store vector + metadata in Qdrant collection |
| 170 | + ↓ |
| 171 | +[5] Indexed! Ready for RAG queries |
| 172 | +``` |
| 173 | + |
| 174 | +--- |
| 175 | + |
| 176 | +## 🎯 RAG Query Flow |
| 177 | + |
| 178 | +``` |
| 179 | +User Query: "What are the phases of penetration testing?" |
| 180 | + ↓ |
| 181 | +[1] EmbeddingModel.embed(query) |
| 182 | + → Convert query to vector: [0.123, -0.456, ...] |
| 183 | + ↓ |
| 184 | +[2] Qdrant similarity search in "pentesting" collection |
| 185 | + → Find top 5 most similar vectors (cosine similarity) |
| 186 | + → Return corresponding text chunks + metadata |
| 187 | + ↓ |
| 188 | +[3] Build context from retrieved chunks |
| 189 | + → Combine chunk texts with metadata |
| 190 | + ↓ |
| 191 | +[4] Send to Ollama LLM |
| 192 | + → Prompt: "Answer based on this context: [chunks]... Question: [query]" |
| 193 | + ↓ |
| 194 | +[5] LLM generates answer |
| 195 | + → Uses retrieved chunks as knowledge base |
| 196 | + → Returns synthesized answer |
| 197 | +``` |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## 📦 Key Components Summary |
| 202 | + |
| 203 | +| Component | Purpose | Location | |
| 204 | +|-----------|---------|----------| |
| 205 | +| **EmbeddingModel** | Converts text ↔ vectors | Injected by Spring AI | |
| 206 | +| **PagePdfDocumentReader** | Extracts text from PDF | TopicDocumentService.java:73-75 | |
| 207 | +| **TokenTextSplitter** | Chunks text into smaller pieces | TopicDocumentService.java:76-77 | |
| 208 | +| **QdrantVectorStore** | Stores vectors + metadata | VectorStoreFactory.java:46-48 | |
| 209 | +| **VectorStore.add()** | Triggers embedding generation | TopicDocumentService.java:107 | |
| 210 | +| **VectorStore.similaritySearch()** | Queries with embeddings | TopicRagService.java:53 | |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## 🔧 Where is the Embedding Model Configured? |
| 215 | + |
| 216 | +### **Answer: It's Auto-Configured by Spring AI!** ✨ |
| 217 | + |
| 218 | +You don't explicitly configure the `EmbeddingModel` bean - Spring Boot does it automatically via **auto-configuration**. |
| 219 | + |
| 220 | +### **How It Works:** |
| 221 | + |
| 222 | +#### 1. **Maven Dependency** (pom.xml) |
| 223 | +```xml |
| 224 | +<dependency> |
| 225 | + <groupId>org.springframework.ai</groupId> |
| 226 | + <artifactId>spring-ai-starter-model-ollama</artifactId> |
| 227 | +</dependency> |
| 228 | +``` |
| 229 | +This starter includes: |
| 230 | +- `OllamaEmbeddingModel` implementation |
| 231 | +- `OllamaChatModel` implementation |
| 232 | +- Auto-configuration classes |
| 233 | + |
| 234 | +#### 2. **Configuration in application.yaml** |
| 235 | +```yaml |
| 236 | +spring: |
| 237 | + ai: |
| 238 | + ollama: |
| 239 | + base-url: http://localhost:11434 # ← Ollama server URL |
| 240 | + model: llama2 # ← Default model for chat |
| 241 | +``` |
| 242 | +
|
| 243 | +**Important:** The embedding model in Ollama uses **the same base-url** but typically uses a **different model** optimized for embeddings. |
| 244 | +
|
| 245 | +#### 3. **Spring AI Auto-Configuration** |
| 246 | +When your application starts: |
| 247 | +``` |
| 248 | +Spring Boot detects spring-ai-starter-model-ollama |
| 249 | + ↓ |
| 250 | +Auto-configures OllamaEmbeddingModel bean |
| 251 | + ↓ |
| 252 | +Uses configuration from spring.ai.ollama.base-url |
| 253 | + ↓ |
| 254 | +By default, uses model: "nomic-embed-text" or "mxbai-embed-large" |
| 255 | + ↓ |
| 256 | +Bean is injected into VectorStoreFactory |
| 257 | +``` |
| 258 | + |
| 259 | +#### 4. **Which Embedding Model is Actually Used?** |
| 260 | + |
| 261 | +Spring AI's Ollama starter uses **Ollama's default embedding model**, typically: |
| 262 | +- **`nomic-embed-text`** (most common, 768 dimensions) |
| 263 | +- **`mxbai-embed-large`** (alternative, 1024 dimensions) |
| 264 | +- **`all-minilm`** (smaller, 384 dimensions) |
| 265 | + |
| 266 | +To check which model Ollama is using: |
| 267 | +```bash |
| 268 | +# List available embedding models |
| 269 | +ollama list |
| 270 | + |
| 271 | +# Pull a specific embedding model if needed |
| 272 | +ollama pull nomic-embed-text |
| 273 | +``` |
| 274 | + |
| 275 | +### **How to Change the Embedding Model?** |
| 276 | + |
| 277 | +#### Option 1: **Add explicit configuration in application.yaml** (Recommended) |
| 278 | +```yaml |
| 279 | +spring: |
| 280 | + ai: |
| 281 | + ollama: |
| 282 | + base-url: http://localhost:11434 |
| 283 | + chat: |
| 284 | + model: llama2 # For chat/text generation |
| 285 | + embedding: |
| 286 | + model: nomic-embed-text # For embeddings (explicit) |
| 287 | +``` |
| 288 | +
|
| 289 | +#### Option 2: **Create a custom @Bean** (Advanced) |
| 290 | +Create a new config file: |
| 291 | +```java |
| 292 | +@Configuration |
| 293 | +public class EmbeddingConfig { |
| 294 | + |
| 295 | + @Bean |
| 296 | + public EmbeddingModel embeddingModel() { |
| 297 | + return OllamaEmbeddingModel.builder() |
| 298 | + .withBaseUrl("http://localhost:11434") |
| 299 | + .withModel("nomic-embed-text") // Specify model |
| 300 | + .build(); |
| 301 | + } |
| 302 | +} |
| 303 | +``` |
| 304 | + |
| 305 | +### **Current Setup in Your Project:** |
| 306 | + |
| 307 | +✅ **Dependency:** `spring-ai-starter-model-ollama` in pom.xml |
| 308 | +✅ **Base URL:** `http://localhost:11434` in application.yaml |
| 309 | +✅ **Auto-configured:** `EmbeddingModel` bean created automatically |
| 310 | +✅ **Injected:** Into `VectorStoreFactory` via `@RequiredArgsConstructor` |
| 311 | +⚠️ **Embedding Model:** Uses Ollama's default (likely `nomic-embed-text`) |
| 312 | + |
| 313 | +### **To Verify What's Being Used:** |
| 314 | + |
| 315 | +Check your application logs when it starts: |
| 316 | +``` |
| 317 | +DEBUG org.springframework.ai.ollama - Using Ollama embedding model: nomic-embed-text |
| 318 | +``` |
| 319 | + |
| 320 | +Or make a test API call to see the model in action! |
| 321 | + |
| 322 | +--- |
| 323 | + |
| 324 | +## 💡 Key Takeaways |
| 325 | + |
| 326 | +1. **You never directly call the embedding model** - Spring AI handles it internally |
| 327 | +2. **Embeddings are generated twice:** |
| 328 | + - When storing documents (`vectorStore.add()`) |
| 329 | + - When searching (`similaritySearch()`) |
| 330 | +3. **Chunking happens in TopicDocumentService** using `TokenTextSplitter` |
| 331 | +4. **Each chunk becomes a vector** stored in Qdrant |
| 332 | +5. **Metadata travels with vectors** for context in answers |
| 333 | + |
| 334 | +This is a clean, well-architected RAG system! 🚀 |
| 335 | + |
0 commit comments