|
| 1 | +# Long Text Embedding with Chunked Processing |
| 2 | + |
| 3 | +This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length. |
| 4 | + |
| 5 | +## 🚀 Quick Start |
| 6 | + |
| 7 | +### Start the Server |
| 8 | + |
| 9 | +Use the provided script to start a vLLM server with chunked processing enabled: |
| 10 | + |
| 11 | +```bash |
| 12 | +# Basic usage (supports very long texts up to ~3M tokens) |
| 13 | +./service.sh |
| 14 | + |
| 15 | +# Custom configuration with different models |
| 16 | +MODEL_NAME="jinaai/jina-embeddings-v3" \ |
| 17 | +MAX_EMBED_LEN=1048576 \ |
| 18 | +./service.sh |
| 19 | + |
| 20 | +# For extremely long documents |
| 21 | +MODEL_NAME="intfloat/multilingual-e5-large" \ |
| 22 | +MAX_EMBED_LEN=3072000 \ |
| 23 | +./service.sh |
| 24 | +``` |
| 25 | + |
| 26 | +### Test Long Text Embedding |
| 27 | + |
| 28 | +Run the comprehensive test client: |
| 29 | + |
| 30 | +```bash |
| 31 | +python client.py |
| 32 | +``` |
| 33 | + |
| 34 | +## 📁 Files |
| 35 | + |
| 36 | +| File | Description | |
| 37 | +|------|-------------| |
| 38 | +| `service.sh` | Server startup script with chunked processing enabled | |
| 39 | +| `client.py` | Comprehensive test client for long text embedding | |
| 40 | + |
| 41 | +## ⚙️ Configuration |
| 42 | + |
| 43 | +### Server Configuration |
| 44 | + |
| 45 | +The key parameters for chunked processing are in the `--override-pooler-config`: |
| 46 | + |
| 47 | +```json |
| 48 | +{ |
| 49 | + "pooling_type": "auto", |
| 50 | + "normalize": true, |
| 51 | + "enable_chunked_processing": true, |
| 52 | + "max_embed_len": 3072000 |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +!!! note |
| 57 | + `pooling_type` sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length. |
| 58 | + |
| 59 | +#### Chunked Processing Behavior |
| 60 | + |
| 61 | +Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length: |
| 62 | + |
| 63 | +| Component | Behavior | Description | |
| 64 | +|-----------|----------|-------------| |
| 65 | +| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy | |
| 66 | +| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts | |
| 67 | +| **Performance** | Optimal | All chunks processed for complete semantic coverage | |
| 68 | + |
| 69 | +### Environment Variables |
| 70 | + |
| 71 | +| Variable | Default | Description | |
| 72 | +|----------|---------|-------------| |
| 73 | +| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) | |
| 74 | +| `PORT` | `31090` | Server port | |
| 75 | +| `GPU_COUNT` | `1` | Number of GPUs to use | |
| 76 | +| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) | |
| 77 | +| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` (only affects within-chunk pooling, not cross-chunk aggregation) | |
| 78 | +| `API_KEY` | `EMPTY` | API key for authentication | |
| 79 | + |
| 80 | +## 🔧 How It Works |
| 81 | + |
| 82 | +1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables |
| 83 | +2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity |
| 84 | +3. **Unified Processing**: All chunks processed separately through the model using its configured pooling strategy |
| 85 | +4. **MEAN Aggregation**: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks |
| 86 | +5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing |
| 87 | + |
| 88 | +### Input Length Handling |
| 89 | + |
| 90 | +- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens) |
| 91 | +- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered |
| 92 | +- **Exceeds max_embed_len**: Input is rejected with clear error message |
| 93 | +- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN` |
| 94 | + |
| 95 | +### Extreme Long Text Support |
| 96 | + |
| 97 | +With `MAX_EMBED_LEN=3072000`, you can process: |
| 98 | + |
| 99 | +- **Academic papers**: Full research papers with references |
| 100 | +- **Legal documents**: Complete contracts and legal texts |
| 101 | +- **Books**: Entire chapters or small books |
| 102 | +- **Code repositories**: Large codebases and documentation |
| 103 | + |
| 104 | +## 📊 Performance Characteristics |
| 105 | + |
| 106 | +### Chunked Processing Performance |
| 107 | + |
| 108 | +| Aspect | Behavior | Performance | |
| 109 | +|--------|----------|-------------| |
| 110 | +| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length | |
| 111 | +| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead | |
| 112 | +| **Memory Usage** | Proportional to number of chunks | Moderate, scalable | |
| 113 | +| **Semantic Quality** | Complete text coverage | Optimal for long documents | |
| 114 | + |
| 115 | +## 🧪 Test Cases |
| 116 | + |
| 117 | +The test client demonstrates: |
| 118 | + |
| 119 | +- ✅ **Short text**: Normal processing (baseline) |
| 120 | +- ✅ **Medium text**: Single chunk processing |
| 121 | +- ✅ **Long text**: Multi-chunk processing with aggregation |
| 122 | +- ✅ **Very long text**: Many chunks processing |
| 123 | +- ✅ **Extreme long text**: Document-level processing (100K+ tokens) |
| 124 | +- ✅ **Batch processing**: Mixed-length inputs in one request |
| 125 | +- ✅ **Consistency**: Reproducible results across runs |
| 126 | + |
| 127 | +## 🐛 Troubleshooting |
| 128 | + |
| 129 | +### Common Issues |
| 130 | + |
| 131 | +1. **Chunked processing not enabled**: |
| 132 | + |
| 133 | + ```log |
| 134 | + ValueError: This model's maximum position embeddings length is 4096 tokens... |
| 135 | + ``` |
| 136 | + |
| 137 | + **Solution**: Ensure `enable_chunked_processing: true` in pooler config |
| 138 | + |
| 139 | +2. **Input exceeds max_embed_len**: |
| 140 | + |
| 141 | + ```log |
| 142 | + ValueError: This model's maximum embedding input length is 3072000 tokens... |
| 143 | + ``` |
| 144 | + |
| 145 | + **Solution**: Increase `max_embed_len` in pooler config or reduce input length |
| 146 | + |
| 147 | +3. **Memory errors**: |
| 148 | + |
| 149 | + ```log |
| 150 | + RuntimeError: CUDA out of memory |
| 151 | + ``` |
| 152 | + |
| 153 | + **Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs |
| 154 | + |
| 155 | +4. **Slow processing**: |
| 156 | + **Expected**: Long text takes more time due to multiple inference calls |
| 157 | + |
| 158 | +### Debug Information |
| 159 | + |
| 160 | +Server logs show chunked processing activity: |
| 161 | + |
| 162 | +```log |
| 163 | +INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing |
| 164 | +INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096) |
| 165 | +``` |
| 166 | + |
| 167 | +## 🤝 Contributing |
| 168 | + |
| 169 | +To extend chunked processing support to other embedding models: |
| 170 | + |
| 171 | +1. Check model compatibility with the pooling architecture |
| 172 | +2. Test with various text lengths |
| 173 | +3. Validate embedding quality compared to single-chunk processing |
| 174 | +4. Submit PR with test cases and documentation updates |
| 175 | + |
| 176 | +## 🆕 Enhanced Features |
| 177 | + |
| 178 | +### max_embed_len Parameter |
| 179 | + |
| 180 | +The new `max_embed_len` parameter provides: |
| 181 | + |
| 182 | +- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable |
| 183 | +- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len` |
| 184 | +- **Extreme Length Support**: Process documents with millions of tokens |
| 185 | +- **Clear Error Messages**: Better feedback when inputs exceed limits |
| 186 | +- **Backward Compatibility**: Existing configurations continue to work |
0 commit comments