Skip to content

Commit b6135ea

Browse files
author
beaglebyte
committed
Add embedding model, application.yml
Add frontend to static served by spring documentations
1 parent ce87ab1 commit b6135ea

File tree

7 files changed

+933
-1
lines changed

7 files changed

+933
-1
lines changed
Lines changed: 335 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,335 @@
1+
# 📊 Embedding Model & PDF Chunking Analysis
2+
3+
## 🔍 Where is the EmbeddingModel Used?
4+
5+
### 1. **VectorStoreFactory.java** (Line 26)
6+
```java
7+
@RequiredArgsConstructor
8+
public class VectorStoreFactory {
9+
private final QdrantClient qdrantClient;
10+
private final TopicConfig topicConfig;
11+
private final EmbeddingModel embeddingModel; // ← Injected by Spring
12+
}
13+
```
14+
15+
**Purpose:** The `EmbeddingModel` is injected into the factory and used to create vector stores.
16+
17+
---
18+
19+
### 2. **Creating QdrantVectorStore** (VectorStoreFactory.java, Line 46-48)
20+
```java
21+
QdrantVectorStore vectorStore = QdrantVectorStore.builder(qdrantClient, embeddingModel)
22+
.collectionName(collectionName)
23+
.build();
24+
```
25+
26+
**What happens here:**
27+
- The `embeddingModel` is passed to `QdrantVectorStore.builder()`
28+
- This embedding model is used **internally** by Spring AI to convert text into vectors
29+
30+
---
31+
32+
### 3. **When Embeddings are Generated** (Automatic)
33+
34+
#### A) **During Document Upload** (TopicDocumentService.java, Line 107)
35+
```java
36+
VectorStore topicVectorStore = vectorStoreFactory.getVectorStore(topic);
37+
topicVectorStore.add(splitDocuments); // ← Embeddings generated HERE!
38+
```
39+
40+
**Process:**
41+
1. You upload a PDF
42+
2. PDF gets chunked (see below)
43+
3. `vectorStore.add()` is called
44+
4. **Spring AI automatically calls the EmbeddingModel** to convert each text chunk into a vector
45+
5. Vectors are stored in Qdrant with metadata
46+
47+
#### B) **During Search Queries** (TopicRagService.java, Line 53)
48+
```java
49+
List<Document> relevantDocs = topicVectorStore.similaritySearch(searchRequest);
50+
```
51+
52+
**Process:**
53+
1. User sends a query: "What is penetration testing?"
54+
2. **Spring AI automatically calls the EmbeddingModel** to convert the query into a vector
55+
3. Qdrant searches for similar vectors (cosine similarity)
56+
4. Returns the most relevant document chunks
57+
58+
---
59+
60+
## ✂️ Where PDFs Get Chunked
61+
62+
### Location: **TopicDocumentService.java**
63+
64+
#### **Step 1: Read PDF** (Lines 73-75)
65+
```java
66+
ByteArrayResource resource = new ByteArrayResource(pdfBytes);
67+
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(resource);
68+
List<Document> documents = pdfReader.get(); // ← Reads PDF page by page
69+
```
70+
- Uses **Spring AI's PagePdfDocumentReader**
71+
- Extracts text from each PDF page
72+
- Creates one `Document` object per page
73+
74+
---
75+
76+
#### **Step 2: Split into Chunks** (Lines 76-77)
77+
```java
78+
TokenTextSplitter splitter = new TokenTextSplitter();
79+
List<Document> splitDocuments = splitter.split(documents); // ← CHUNKING HAPPENS HERE!
80+
```
81+
82+
**What TokenTextSplitter does:**
83+
- Takes the page-level documents
84+
- Splits them into **smaller chunks** based on token count
85+
- Default configuration (you can customize):
86+
- **Chunk size:** ~800 tokens (~3000 characters)
87+
- **Chunk overlap:** ~400 tokens (prevents losing context at boundaries)
88+
89+
**Example:**
90+
```
91+
Original PDF (3 pages):
92+
├── Page 1: 2000 tokens
93+
├── Page 2: 1800 tokens
94+
└── Page 3: 1500 tokens
95+
96+
After TokenTextSplitter:
97+
├── Chunk 0: 800 tokens (from Page 1, start)
98+
├── Chunk 1: 800 tokens (from Page 1, middle) [400 token overlap with Chunk 0]
99+
├── Chunk 2: 800 tokens (from Page 1 end + Page 2 start)
100+
├── Chunk 3: 800 tokens (from Page 2)
101+
├── Chunk 4: 800 tokens (from Page 2 end + Page 3 start)
102+
└── Chunk 5: 700 tokens (from Page 3, end)
103+
104+
Total: 6 chunks with overlaps for context preservation
105+
```
106+
107+
---
108+
109+
#### **Step 3: Enrich Metadata** (Lines 79-103)
110+
```java
111+
splitDocuments.forEach(doc -> {
112+
Map<String, Object> metadata = doc.getMetadata();
113+
metadata.put("docId", docId);
114+
metadata.put("filename", filename);
115+
metadata.put("topic", topic); // ← Topic for routing
116+
metadata.put("chunkIndex", chunkIndex.getAndIncrement());
117+
metadata.put("title", pdfMetadata.get("title"));
118+
metadata.put("author", pdfMetadata.get("author"));
119+
metadata.put("publishingYear", pdfMetadata.get("publishingYear"));
120+
});
121+
```
122+
123+
Each chunk now has:
124+
- Document ID
125+
- Original filename
126+
- Topic (pentesting, iot, etc.)
127+
- Chunk index (for ordering)
128+
- PDF metadata (title, author, year)
129+
130+
---
131+
132+
#### **Step 4: Convert to Vectors & Store** (Lines 106-107)
133+
```java
134+
VectorStore topicVectorStore = vectorStoreFactory.getVectorStore(topic);
135+
topicVectorStore.add(splitDocuments); // ← Embeddings created & stored in Qdrant
136+
```
137+
138+
**Behind the scenes:**
139+
```
140+
For each chunk:
141+
1. Text: "Penetration testing involves..."
142+
2. EmbeddingModel.embed(text) → [0.234, -0.123, 0.456, ..., 0.789] (768 dimensions)
143+
3. Store in Qdrant:
144+
- Vector: [0.234, -0.123, ...]
145+
- Metadata: {filename: "pentest.pdf", topic: "pentesting", chunkIndex: 0}
146+
```
147+
148+
---
149+
150+
## 🔄 Complete Flow Diagram
151+
152+
```
153+
PDF Upload (pentest.pdf, 50 pages)
154+
155+
[1] PagePdfDocumentReader.get()
156+
→ Extract text from 50 pages
157+
→ Create 50 Document objects
158+
159+
[2] TokenTextSplitter.split()
160+
→ Split 50 pages into ~200 chunks (depending on content)
161+
→ Each chunk: ~800 tokens with 400 token overlap
162+
163+
[3] Enrich with metadata
164+
→ Add docId, filename, topic, chunkIndex, author, year
165+
166+
[4] vectorStore.add(splitDocuments)
167+
→ For each of 200 chunks:
168+
a) EmbeddingModel converts text → vector (768 dimensions)
169+
b) Store vector + metadata in Qdrant collection
170+
171+
[5] Indexed! Ready for RAG queries
172+
```
173+
174+
---
175+
176+
## 🎯 RAG Query Flow
177+
178+
```
179+
User Query: "What are the phases of penetration testing?"
180+
181+
[1] EmbeddingModel.embed(query)
182+
→ Convert query to vector: [0.123, -0.456, ...]
183+
184+
[2] Qdrant similarity search in "pentesting" collection
185+
→ Find top 5 most similar vectors (cosine similarity)
186+
→ Return corresponding text chunks + metadata
187+
188+
[3] Build context from retrieved chunks
189+
→ Combine chunk texts with metadata
190+
191+
[4] Send to Ollama LLM
192+
→ Prompt: "Answer based on this context: [chunks]... Question: [query]"
193+
194+
[5] LLM generates answer
195+
→ Uses retrieved chunks as knowledge base
196+
→ Returns synthesized answer
197+
```
198+
199+
---
200+
201+
## 📦 Key Components Summary
202+
203+
| Component | Purpose | Location |
204+
|-----------|---------|----------|
205+
| **EmbeddingModel** | Converts text ↔ vectors | Injected by Spring AI |
206+
| **PagePdfDocumentReader** | Extracts text from PDF | TopicDocumentService.java:73-75 |
207+
| **TokenTextSplitter** | Chunks text into smaller pieces | TopicDocumentService.java:76-77 |
208+
| **QdrantVectorStore** | Stores vectors + metadata | VectorStoreFactory.java:46-48 |
209+
| **VectorStore.add()** | Triggers embedding generation | TopicDocumentService.java:107 |
210+
| **VectorStore.similaritySearch()** | Queries with embeddings | TopicRagService.java:53 |
211+
212+
---
213+
214+
## 🔧 Where is the Embedding Model Configured?
215+
216+
### **Answer: It's Auto-Configured by Spring AI!**
217+
218+
You don't explicitly configure the `EmbeddingModel` bean - Spring Boot does it automatically via **auto-configuration**.
219+
220+
### **How It Works:**
221+
222+
#### 1. **Maven Dependency** (pom.xml)
223+
```xml
224+
<dependency>
225+
<groupId>org.springframework.ai</groupId>
226+
<artifactId>spring-ai-starter-model-ollama</artifactId>
227+
</dependency>
228+
```
229+
This starter includes:
230+
- `OllamaEmbeddingModel` implementation
231+
- `OllamaChatModel` implementation
232+
- Auto-configuration classes
233+
234+
#### 2. **Configuration in application.yaml**
235+
```yaml
236+
spring:
237+
ai:
238+
ollama:
239+
base-url: http://localhost:11434 # ← Ollama server URL
240+
model: llama2 # ← Default model for chat
241+
```
242+
243+
**Important:** The embedding model in Ollama uses **the same base-url** but typically uses a **different model** optimized for embeddings.
244+
245+
#### 3. **Spring AI Auto-Configuration**
246+
When your application starts:
247+
```
248+
Spring Boot detects spring-ai-starter-model-ollama
249+
250+
Auto-configures OllamaEmbeddingModel bean
251+
252+
Uses configuration from spring.ai.ollama.base-url
253+
254+
By default, uses model: "nomic-embed-text" or "mxbai-embed-large"
255+
256+
Bean is injected into VectorStoreFactory
257+
```
258+
259+
#### 4. **Which Embedding Model is Actually Used?**
260+
261+
Spring AI's Ollama starter uses **Ollama's default embedding model**, typically:
262+
- **`nomic-embed-text`** (most common, 768 dimensions)
263+
- **`mxbai-embed-large`** (alternative, 1024 dimensions)
264+
- **`all-minilm`** (smaller, 384 dimensions)
265+
266+
To check which model Ollama is using:
267+
```bash
268+
# List available embedding models
269+
ollama list
270+
271+
# Pull a specific embedding model if needed
272+
ollama pull nomic-embed-text
273+
```
274+
275+
### **How to Change the Embedding Model?**
276+
277+
#### Option 1: **Add explicit configuration in application.yaml** (Recommended)
278+
```yaml
279+
spring:
280+
ai:
281+
ollama:
282+
base-url: http://localhost:11434
283+
chat:
284+
model: llama2 # For chat/text generation
285+
embedding:
286+
model: nomic-embed-text # For embeddings (explicit)
287+
```
288+
289+
#### Option 2: **Create a custom @Bean** (Advanced)
290+
Create a new config file:
291+
```java
292+
@Configuration
293+
public class EmbeddingConfig {
294+
295+
@Bean
296+
public EmbeddingModel embeddingModel() {
297+
return OllamaEmbeddingModel.builder()
298+
.withBaseUrl("http://localhost:11434")
299+
.withModel("nomic-embed-text") // Specify model
300+
.build();
301+
}
302+
}
303+
```
304+
305+
### **Current Setup in Your Project:**
306+
307+
**Dependency:** `spring-ai-starter-model-ollama` in pom.xml
308+
**Base URL:** `http://localhost:11434` in application.yaml
309+
**Auto-configured:** `EmbeddingModel` bean created automatically
310+
**Injected:** Into `VectorStoreFactory` via `@RequiredArgsConstructor`
311+
⚠️ **Embedding Model:** Uses Ollama's default (likely `nomic-embed-text`)
312+
313+
### **To Verify What's Being Used:**
314+
315+
Check your application logs when it starts:
316+
```
317+
DEBUG org.springframework.ai.ollama - Using Ollama embedding model: nomic-embed-text
318+
```
319+
320+
Or make a test API call to see the model in action!
321+
322+
---
323+
324+
## 💡 Key Takeaways
325+
326+
1. **You never directly call the embedding model** - Spring AI handles it internally
327+
2. **Embeddings are generated twice:**
328+
- When storing documents (`vectorStore.add()`)
329+
- When searching (`similaritySearch()`)
330+
3. **Chunking happens in TopicDocumentService** using `TokenTextSplitter`
331+
4. **Each chunk becomes a vector** stored in Qdrant
332+
5. **Metadata travels with vectors** for context in answers
333+
334+
This is a clean, well-architected RAG system! 🚀
335+

0 commit comments

Comments
 (0)