Skip to content

Commit 653124b

Browse files
x22x22DarkLight1337maxdebayser
authored
[Frontend] Add chunked processing to handle long inputs in embedding models (#22280)
Signed-off-by: x22x22 <[email protected]> Signed-off-by: Kdump <[email protected]> Signed-off-by: DarkLight1337 <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: DarkLight1337 <[email protected]>
1 parent 0b1bdac commit 653124b

File tree

6 files changed

+1603
-3
lines changed

6 files changed

+1603
-3
lines changed
Lines changed: 186 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,186 @@
1+
# Long Text Embedding with Chunked Processing
2+
3+
This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length.
4+
5+
## 🚀 Quick Start
6+
7+
### Start the Server
8+
9+
Use the provided script to start a vLLM server with chunked processing enabled:
10+
11+
```bash
12+
# Basic usage (supports very long texts up to ~3M tokens)
13+
./service.sh
14+
15+
# Custom configuration with different models
16+
MODEL_NAME="jinaai/jina-embeddings-v3" \
17+
MAX_EMBED_LEN=1048576 \
18+
./service.sh
19+
20+
# For extremely long documents
21+
MODEL_NAME="intfloat/multilingual-e5-large" \
22+
MAX_EMBED_LEN=3072000 \
23+
./service.sh
24+
```
25+
26+
### Test Long Text Embedding
27+
28+
Run the comprehensive test client:
29+
30+
```bash
31+
python client.py
32+
```
33+
34+
## 📁 Files
35+
36+
| File | Description |
37+
|------|-------------|
38+
| `service.sh` | Server startup script with chunked processing enabled |
39+
| `client.py` | Comprehensive test client for long text embedding |
40+
41+
## ⚙️ Configuration
42+
43+
### Server Configuration
44+
45+
The key parameters for chunked processing are in the `--override-pooler-config`:
46+
47+
```json
48+
{
49+
"pooling_type": "auto",
50+
"normalize": true,
51+
"enable_chunked_processing": true,
52+
"max_embed_len": 3072000
53+
}
54+
```
55+
56+
!!! note
57+
`pooling_type` sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length.
58+
59+
#### Chunked Processing Behavior
60+
61+
Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:
62+
63+
| Component | Behavior | Description |
64+
|-----------|----------|-------------|
65+
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
66+
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
67+
| **Performance** | Optimal | All chunks processed for complete semantic coverage |
68+
69+
### Environment Variables
70+
71+
| Variable | Default | Description |
72+
|----------|---------|-------------|
73+
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
74+
| `PORT` | `31090` | Server port |
75+
| `GPU_COUNT` | `1` | Number of GPUs to use |
76+
| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) |
77+
| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` (only affects within-chunk pooling, not cross-chunk aggregation) |
78+
| `API_KEY` | `EMPTY` | API key for authentication |
79+
80+
## 🔧 How It Works
81+
82+
1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables
83+
2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
84+
3. **Unified Processing**: All chunks processed separately through the model using its configured pooling strategy
85+
4. **MEAN Aggregation**: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks
86+
5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing
87+
88+
### Input Length Handling
89+
90+
- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens)
91+
- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered
92+
- **Exceeds max_embed_len**: Input is rejected with clear error message
93+
- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN`
94+
95+
### Extreme Long Text Support
96+
97+
With `MAX_EMBED_LEN=3072000`, you can process:
98+
99+
- **Academic papers**: Full research papers with references
100+
- **Legal documents**: Complete contracts and legal texts
101+
- **Books**: Entire chapters or small books
102+
- **Code repositories**: Large codebases and documentation
103+
104+
## 📊 Performance Characteristics
105+
106+
### Chunked Processing Performance
107+
108+
| Aspect | Behavior | Performance |
109+
|--------|----------|-------------|
110+
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
111+
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
112+
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable |
113+
| **Semantic Quality** | Complete text coverage | Optimal for long documents |
114+
115+
## 🧪 Test Cases
116+
117+
The test client demonstrates:
118+
119+
-**Short text**: Normal processing (baseline)
120+
-**Medium text**: Single chunk processing
121+
-**Long text**: Multi-chunk processing with aggregation
122+
-**Very long text**: Many chunks processing
123+
-**Extreme long text**: Document-level processing (100K+ tokens)
124+
-**Batch processing**: Mixed-length inputs in one request
125+
-**Consistency**: Reproducible results across runs
126+
127+
## 🐛 Troubleshooting
128+
129+
### Common Issues
130+
131+
1. **Chunked processing not enabled**:
132+
133+
```log
134+
ValueError: This model's maximum position embeddings length is 4096 tokens...
135+
```
136+
137+
**Solution**: Ensure `enable_chunked_processing: true` in pooler config
138+
139+
2. **Input exceeds max_embed_len**:
140+
141+
```log
142+
ValueError: This model's maximum embedding input length is 3072000 tokens...
143+
```
144+
145+
**Solution**: Increase `max_embed_len` in pooler config or reduce input length
146+
147+
3. **Memory errors**:
148+
149+
```log
150+
RuntimeError: CUDA out of memory
151+
```
152+
153+
**Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs
154+
155+
4. **Slow processing**:
156+
**Expected**: Long text takes more time due to multiple inference calls
157+
158+
### Debug Information
159+
160+
Server logs show chunked processing activity:
161+
162+
```log
163+
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
164+
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
165+
```
166+
167+
## 🤝 Contributing
168+
169+
To extend chunked processing support to other embedding models:
170+
171+
1. Check model compatibility with the pooling architecture
172+
2. Test with various text lengths
173+
3. Validate embedding quality compared to single-chunk processing
174+
4. Submit PR with test cases and documentation updates
175+
176+
## 🆕 Enhanced Features
177+
178+
### max_embed_len Parameter
179+
180+
The new `max_embed_len` parameter provides:
181+
182+
- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable
183+
- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len`
184+
- **Extreme Length Support**: Process documents with millions of tokens
185+
- **Clear Error Messages**: Better feedback when inputs exceed limits
186+
- **Backward Compatibility**: Existing configurations continue to work

0 commit comments

Comments
 (0)