Skip to content

Commit 5300034

Browse files
committed
GH-1831: Add auto-truncation support strategies when batching documents
Fixes: #1831 - Document auto-truncation configuration with high token limits - Add integration tests for auto-truncation behavior - Include Spring Boot and manual configuration examples - Test large documents and batching scenarios Enables proper use of embedding model auto-truncation while avoiding batching strategy exceptions. Signed-off-by: Soby Chacko <[email protected]>
1 parent 08814be commit 5300034

File tree

3 files changed

+336
-0
lines changed

3 files changed

+336
-0
lines changed

spring-ai-docs/src/main/antora/modules/ROOT/pages/api/vectordbs.adoc

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,104 @@ TokenCountBatchingStrategy strategy = new TokenCountBatchingStrategy(
236236
);
237237
----
238238

239+
=== Working with Auto-Truncation
240+
241+
Some embedding models, such as Vertex AI text embedding, support an `auto_truncate` feature.
242+
When enabled, this feature allows the embedding model to silently truncate text that exceeds the maximum input size and continue processing.
243+
When disabled, the model throws an explicit error for input exceeding the limits.
244+
245+
When using auto-truncation with the batching strategy, you need a different configuration approach to avoid the exceptions that occur when a single document exceeds the expected token limit.
246+
247+
==== Configuration for Auto-Truncation
248+
249+
When enabling auto-truncation, configure your batching strategy with a much higher input token count than the model's actual maximum.
250+
This prevents the batching strategy from throwing exceptions and allows the embedding model to handle truncation internally.
251+
252+
Here's an example configuration for using Vertex AI with auto-truncation and custom `BatchingStrategy` and then using them in the PgVectorStore:
253+
254+
[source,java]
255+
----
256+
@Configuration
257+
public class AutoTruncationEmbeddingConfig {
258+
259+
@Bean
260+
public VertexAiTextEmbeddingModel vertexAiEmbeddingModel(
261+
VertexAiEmbeddingConnectionDetails connectionDetails) {
262+
263+
VertexAiTextEmbeddingOptions options = VertexAiTextEmbeddingOptions.builder()
264+
.model(VertexAiTextEmbeddingOptions.DEFAULT_MODEL_NAME)
265+
.autoTruncate(true) // Enable auto-truncation
266+
.build();
267+
268+
return new VertexAiTextEmbeddingModel(connectionDetails, options);
269+
}
270+
271+
@Bean
272+
public BatchingStrategy batchingStrategy() {
273+
// Set a much higher token count than the model actually supports
274+
// (e.g., 132,900 when Vertex AI supports only up to 20,000)
275+
return new TokenCountBatchingStrategy(
276+
EncodingType.CL100K_BASE,
277+
132900, // Artificially high limit
278+
0.1 // 10% reserve
279+
);
280+
}
281+
282+
@Bean
283+
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel, BatchingStrategy batchingStrategy) {
284+
return PgVectorStore.builder(jdbcTemplate, embeddingModel)
285+
// other properties omitted here
286+
.build();
287+
}
288+
}
289+
----
290+
291+
In this configuration:
292+
293+
1. The embedding model has auto-truncation enabled, allowing it to handle oversized inputs gracefully
294+
2. The batching strategy uses an artificially high token limit (132,900) that's much larger than the actual model limit (20,000)
295+
3. The vector store uses the embedding model configured and the custom `BatchingStrartegy` bean.
296+
297+
==== Why This Works
298+
299+
This approach works because:
300+
301+
1. The `TokenCountBatchingStrategy` checks if any single document exceeds the configured maximum and throws an `IllegalArgumentException` if it does
302+
2. By setting a very high limit in the batching strategy, we ensure that this check never fails
303+
3. Documents or batches that exceed the actual model limit are handled by the embedding model's auto-truncation feature
304+
4. The embedding model silently truncates excess tokens and continues processing
305+
306+
==== Best Practices
307+
308+
When using auto-truncation:
309+
310+
- Set the batching strategy's max input token count to be at least 5-10x larger than the model's actual limit
311+
- Monitor your logs for truncation warnings from the embedding model
312+
- Consider the implications of silent truncation on your embedding quality
313+
- Test with sample documents to ensure truncated embeddings still meet your requirements
314+
315+
CAUTION: While auto-truncation prevents errors, it can result in incomplete embeddings. Important information at the end of long documents may be lost.
316+
Consider document chunking strategies if preserving all content is critical.
317+
318+
==== Spring Boot Auto-Configuration
319+
320+
If you're using Spring Boot auto-configuration, you must provide a custom `BatchingStrategy` bean to override the default one that comes with Spring AI:
321+
322+
[source,java]
323+
----
324+
@Bean
325+
public BatchingStrategy customBatchingStrategy() {
326+
// This bean will override the default BatchingStrategy
327+
return new TokenCountBatchingStrategy(
328+
EncodingType.CL100K_BASE,
329+
132900, // Much higher than model's actual limit
330+
0.1
331+
);
332+
}
333+
----
334+
335+
The presence of this bean in your application context will automatically replace the default batching strategy used by all vector stores.
336+
239337
=== Custom Implementation
240338

241339
While `TokenCountBatchingStrategy` provides a robust default implementation, you can customize the batching strategy to fit your specific needs.

vector-stores/spring-ai-pgvector-store/pom.xml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,13 @@
7777
<scope>test</scope>
7878
</dependency>
7979

80+
<dependency>
81+
<groupId>org.springframework.ai</groupId>
82+
<artifactId>spring-ai-vertex-ai-embedding</artifactId>
83+
<version>${project.parent.version}</version>
84+
<scope>test</scope>
85+
</dependency>
86+
8087

8188
<dependency>
8289
<groupId>org.springframework.ai</groupId>
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
/*
2+
* Copyright 2025-2025 the original author or authors.
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* https://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
package org.springframework.ai.vectorstore.pgvector;
18+
19+
import java.util.ArrayList;
20+
import java.util.List;
21+
22+
import javax.sql.DataSource;
23+
24+
import com.knuddels.jtokkit.api.EncodingType;
25+
import com.zaxxer.hikari.HikariDataSource;
26+
import org.junit.jupiter.api.Test;
27+
import org.junit.jupiter.api.condition.EnabledIfEnvironmentVariable;
28+
import org.testcontainers.containers.PostgreSQLContainer;
29+
import org.testcontainers.junit.jupiter.Container;
30+
import org.testcontainers.junit.jupiter.Testcontainers;
31+
32+
import org.springframework.ai.document.Document;
33+
import org.springframework.ai.embedding.BatchingStrategy;
34+
import org.springframework.ai.embedding.EmbeddingModel;
35+
import org.springframework.ai.embedding.TokenCountBatchingStrategy;
36+
import org.springframework.ai.vectorstore.SearchRequest;
37+
import org.springframework.ai.vectorstore.VectorStore;
38+
import org.springframework.ai.vertexai.embedding.VertexAiEmbeddingConnectionDetails;
39+
import org.springframework.ai.vertexai.embedding.text.VertexAiTextEmbeddingModel;
40+
import org.springframework.ai.vertexai.embedding.text.VertexAiTextEmbeddingOptions;
41+
import org.springframework.beans.factory.annotation.Value;
42+
import org.springframework.boot.SpringBootConfiguration;
43+
import org.springframework.boot.autoconfigure.EnableAutoConfiguration;
44+
import org.springframework.boot.autoconfigure.jdbc.DataSourceAutoConfiguration;
45+
import org.springframework.boot.autoconfigure.jdbc.DataSourceProperties;
46+
import org.springframework.boot.context.properties.ConfigurationProperties;
47+
import org.springframework.boot.test.context.runner.ApplicationContextRunner;
48+
import org.springframework.context.ApplicationContext;
49+
import org.springframework.context.annotation.Bean;
50+
import org.springframework.context.annotation.Primary;
51+
import org.springframework.jdbc.core.JdbcTemplate;
52+
53+
import static org.assertj.core.api.Assertions.assertThat;
54+
import static org.junit.Assert.assertThrows;
55+
import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
56+
57+
/**
58+
* Integration tests for PgVectorStore with auto-truncation enabled. Tests the behavior
59+
* when using artificially high token limits with Vertex AI's auto-truncation feature.
60+
*
61+
* @author Soby Chacko
62+
*/
63+
@Testcontainers
64+
@EnabledIfEnvironmentVariable(named = "VERTEX_AI_GEMINI_PROJECT_ID", matches = ".*")
65+
@EnabledIfEnvironmentVariable(named = "VERTEX_AI_GEMINI_LOCATION", matches = ".*")
66+
public class PgVectorStoreAutoTruncationIT {
67+
68+
private static final int ARTIFICIAL_TOKEN_LIMIT = 132_900;
69+
70+
@Container
71+
@SuppressWarnings("resource")
72+
static PostgreSQLContainer<?> postgresContainer = new PostgreSQLContainer<>(PgVectorImage.DEFAULT_IMAGE)
73+
.withUsername("postgres")
74+
.withPassword("postgres");
75+
76+
private final ApplicationContextRunner contextRunner = new ApplicationContextRunner()
77+
.withUserConfiguration(PgVectorStoreAutoTruncationIT.TestApplication.class)
78+
.withPropertyValues("test.spring.ai.vectorstore.pgvector.distanceType=COSINE_DISTANCE",
79+
80+
// JdbcTemplate configuration
81+
String.format("app.datasource.url=jdbc:postgresql://%s:%d/%s", postgresContainer.getHost(),
82+
postgresContainer.getMappedPort(5432), "postgres"),
83+
"app.datasource.username=postgres", "app.datasource.password=postgres",
84+
"app.datasource.type=com.zaxxer.hikari.HikariDataSource");
85+
86+
private static void dropTable(ApplicationContext context) {
87+
JdbcTemplate jdbcTemplate = context.getBean(JdbcTemplate.class);
88+
jdbcTemplate.execute("DROP TABLE IF EXISTS vector_store");
89+
}
90+
91+
@Test
92+
public void testAutoTruncationWithLargeDocument() {
93+
this.contextRunner.run(context -> {
94+
VectorStore vectorStore = context.getBean(VectorStore.class);
95+
96+
// Test with a document that exceeds normal token limits but is within our
97+
// artificially high limit
98+
String largeContent = "This is a test document. ".repeat(5000); // ~25,000
99+
// tokens
100+
Document largeDocument = new Document(largeContent);
101+
largeDocument.getMetadata().put("test", "auto-truncation");
102+
103+
// This should not throw an exception due to our high token limit in
104+
// BatchingStrategy
105+
assertDoesNotThrow(() -> vectorStore.add(List.of(largeDocument)));
106+
107+
// Verify the document was stored
108+
List<Document> results = vectorStore
109+
.similaritySearch(SearchRequest.builder().query("test document").topK(1).build());
110+
111+
assertThat(results).hasSize(1);
112+
Document resultDoc = results.get(0);
113+
assertThat(resultDoc.getMetadata()).containsEntry("test", "auto-truncation");
114+
115+
// Test with multiple large documents to ensure batching still works
116+
List<Document> largeDocs = new ArrayList<>();
117+
for (int i = 0; i < 5; i++) {
118+
Document doc = new Document("Large content " + i + " ".repeat(4000));
119+
doc.getMetadata().put("batch", String.valueOf(i));
120+
largeDocs.add(doc);
121+
}
122+
123+
assertDoesNotThrow(() -> vectorStore.add(largeDocs));
124+
125+
// Verify all documents were processed
126+
List<Document> batchResults = vectorStore
127+
.similaritySearch(SearchRequest.builder().query("Large content").topK(5).build());
128+
129+
assertThat(batchResults).hasSizeGreaterThanOrEqualTo(5);
130+
131+
// Clean up
132+
vectorStore.delete(List.of(largeDocument.getId()));
133+
largeDocs.forEach(doc -> vectorStore.delete(List.of(doc.getId())));
134+
135+
dropTable(context);
136+
});
137+
}
138+
139+
@Test
140+
public void testExceedingArtificialLimit() {
141+
this.contextRunner.run(context -> {
142+
BatchingStrategy batchingStrategy = context.getBean(BatchingStrategy.class);
143+
144+
// Create a document that exceeds even our artificially high limit
145+
String massiveContent = "word ".repeat(150000); // ~150,000 tokens (exceeds
146+
// 132,900)
147+
Document massiveDocument = new Document(massiveContent);
148+
149+
// This should throw an exception as it exceeds our configured limit
150+
assertThrows(IllegalArgumentException.class, () -> {
151+
batchingStrategy.batch(List.of(massiveDocument));
152+
});
153+
154+
dropTable(context);
155+
});
156+
}
157+
158+
@SpringBootConfiguration
159+
@EnableAutoConfiguration(exclude = { DataSourceAutoConfiguration.class })
160+
public static class TestApplication {
161+
162+
@Value("${test.spring.ai.vectorstore.pgvector.distanceType}")
163+
PgVectorStore.PgDistanceType distanceType;
164+
165+
@Value("${test.spring.ai.vectorstore.pgvector.initializeSchema:true}")
166+
boolean initializeSchema;
167+
168+
@Value("${test.spring.ai.vectorstore.pgvector.idType:UUID}")
169+
PgVectorStore.PgIdType idType;
170+
171+
@Bean
172+
public VectorStore vectorStore(JdbcTemplate jdbcTemplate, EmbeddingModel embeddingModel,
173+
BatchingStrategy batchingStrategy) {
174+
return PgVectorStore.builder(jdbcTemplate, embeddingModel)
175+
.dimensions(PgVectorStore.INVALID_EMBEDDING_DIMENSION)
176+
.batchingStrategy(batchingStrategy)
177+
.idType(this.idType)
178+
.distanceType(this.distanceType)
179+
.initializeSchema(this.initializeSchema)
180+
.indexType(PgVectorStore.PgIndexType.HNSW)
181+
.removeExistingVectorStoreTable(true)
182+
.build();
183+
}
184+
185+
@Bean
186+
public JdbcTemplate myJdbcTemplate(DataSource dataSource) {
187+
return new JdbcTemplate(dataSource);
188+
}
189+
190+
@Bean
191+
@Primary
192+
@ConfigurationProperties("app.datasource")
193+
public DataSourceProperties dataSourceProperties() {
194+
return new DataSourceProperties();
195+
}
196+
197+
@Bean
198+
public HikariDataSource dataSource(DataSourceProperties dataSourceProperties) {
199+
return dataSourceProperties.initializeDataSourceBuilder().type(HikariDataSource.class).build();
200+
}
201+
202+
@Bean
203+
public VertexAiTextEmbeddingModel vertexAiEmbeddingModel(VertexAiEmbeddingConnectionDetails connectionDetails) {
204+
VertexAiTextEmbeddingOptions options = VertexAiTextEmbeddingOptions.builder()
205+
.model(VertexAiTextEmbeddingOptions.DEFAULT_MODEL_NAME)
206+
// Although this might be the default in Vertex, we are explicitly setting
207+
// this to true to ensure
208+
// that auto truncate is turned on as this is crucial for the
209+
// verifications in this test suite.
210+
.autoTruncate(true)
211+
.build();
212+
213+
return new VertexAiTextEmbeddingModel(connectionDetails, options);
214+
}
215+
216+
@Bean
217+
public VertexAiEmbeddingConnectionDetails connectionDetails() {
218+
return VertexAiEmbeddingConnectionDetails.builder()
219+
.projectId(System.getenv("VERTEX_AI_GEMINI_PROJECT_ID"))
220+
.location(System.getenv("VERTEX_AI_GEMINI_LOCATION"))
221+
.build();
222+
}
223+
224+
@Bean
225+
BatchingStrategy pgVectorStoreBatchingStrategy() {
226+
return new TokenCountBatchingStrategy(EncodingType.CL100K_BASE, ARTIFICIAL_TOKEN_LIMIT, 0.1);
227+
}
228+
229+
}
230+
231+
}

0 commit comments

Comments
 (0)