|
| 1 | +# End-to-End Test Suite |
| 2 | + |
| 3 | +This directory contains comprehensive end-to-end (E2E) tests for llama.cpp, extending beyond unit-focused API testing to validate complete user workflows and component integration. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The E2E test suite provides comprehensive coverage of: |
| 8 | + |
| 9 | +1. **Pipeline Workflows** - Complete model download, loading, and inference workflows |
| 10 | +2. **Tool Integration** - CLI tool testing (llama-cli, llama-bench) |
| 11 | +3. **Multimodal Workflows** - Vision + text processing coordination |
| 12 | +4. **Concurrent Scenarios** - Multi-user simulation and parallel request handling |
| 13 | + |
| 14 | +## Test Files |
| 15 | + |
| 16 | +### test_pipeline_workflows.py |
| 17 | + |
| 18 | +Tests complete pipeline workflows from model acquisition to inference: |
| 19 | + |
| 20 | +- **Model Download & Loading**: Validates HuggingFace model download and loading |
| 21 | +- **State Transitions**: Tracks server state progression (INITIAL → LOADING_MODEL → READY → GENERATING) |
| 22 | +- **Context Management**: Tests extended inference sessions with context preservation |
| 23 | +- **KV Cache Behavior**: Validates cache utilization during workflows |
| 24 | +- **Streaming Pipeline**: Tests streaming inference through complete pipeline |
| 25 | +- **Embedding Models**: Validates embedding model pipelines |
| 26 | + |
| 27 | +**Example:** |
| 28 | +```bash |
| 29 | +./tests.sh e2e/test_pipeline_workflows.py::test_basic_pipeline_workflow |
| 30 | +``` |
| 31 | + |
| 32 | +### test_tool_integration.py |
| 33 | + |
| 34 | +Tests CLI tool integration and coordination: |
| 35 | + |
| 36 | +- **llama-cli Execution**: Basic and advanced CLI usage patterns |
| 37 | +- **llama-bench Testing**: Performance benchmark execution |
| 38 | +- **Embedding Generation**: CLI-based embedding workflows |
| 39 | +- **Parameter Validation**: Error handling and validation |
| 40 | +- **Server/CLI Coordination**: Resource sharing between tools |
| 41 | + |
| 42 | +**Example:** |
| 43 | +```bash |
| 44 | +./tests.sh e2e/test_tool_integration.py::test_cli_basic_execution |
| 45 | +``` |
| 46 | + |
| 47 | +### test_multimodal_workflows.py |
| 48 | + |
| 49 | +Tests multimodal (vision + text) processing: |
| 50 | + |
| 51 | +- **Model Loading**: Multimodal model initialization with vision projection |
| 52 | +- **Image Processing**: Image input handling with text completion |
| 53 | +- **Context Preservation**: Cross-modal context management |
| 54 | +- **Sequential Requests**: Mixed text-only and multimodal requests |
| 55 | +- **Streaming**: Multimodal streaming responses |
| 56 | +- **Error Handling**: Invalid input handling |
| 57 | + |
| 58 | +**Example:** |
| 59 | +```bash |
| 60 | +./tests.sh e2e/test_multimodal_workflows.py::test_multimodal_chat_with_image |
| 61 | +``` |
| 62 | + |
| 63 | +### test_concurrent_scenarios.py |
| 64 | + |
| 65 | +Tests concurrent request handling and real-world scenarios: |
| 66 | + |
| 67 | +- **Concurrent Requests**: Multiple simultaneous completion/chat requests |
| 68 | +- **Multi-turn Conversations**: Context preservation across conversation turns |
| 69 | +- **Slot Management**: Request queuing and slot allocation under load |
| 70 | +- **Streaming Concurrency**: Multiple streaming sessions |
| 71 | +- **LoRA Switching**: Adapter loading/switching during active sessions |
| 72 | +- **Mixed Workloads**: Different request types running concurrently |
| 73 | + |
| 74 | +**Example:** |
| 75 | +```bash |
| 76 | +./tests.sh e2e/test_concurrent_scenarios.py::test_concurrent_completion_requests |
| 77 | +``` |
| 78 | + |
| 79 | +## Framework Extensions |
| 80 | + |
| 81 | +### PipelineTestProcess Class |
| 82 | + |
| 83 | +The `PipelineTestProcess` class extends `ServerProcess` with E2E testing capabilities: |
| 84 | + |
| 85 | +```python |
| 86 | +from utils import PipelineTestProcess |
| 87 | + |
| 88 | +# Create pipeline test instance |
| 89 | +pipeline = PipelineTestProcess() |
| 90 | + |
| 91 | +# Test complete pipeline workflow |
| 92 | +results = pipeline.test_full_pipeline({ |
| 93 | + "model_hf_repo": "ggml-org/models", |
| 94 | + "model_hf_file": "tinyllamas/stories260K.gguf", |
| 95 | + "n_ctx": 512, |
| 96 | +}) |
| 97 | + |
| 98 | +# Run CLI commands |
| 99 | +result = pipeline.run_cli_command(["-m", model_path, "-p", "Hello", "-n", "16"]) |
| 100 | + |
| 101 | +# Run benchmarks |
| 102 | +bench_results = pipeline.run_bench_command(model_path, ["-p", "8", "-n", "8"]) |
| 103 | +``` |
| 104 | + |
| 105 | +**Key Methods:** |
| 106 | + |
| 107 | +- `test_full_pipeline(model_config)` - Execute complete pipeline workflow |
| 108 | +- `run_cli_command(args, input_text, timeout)` - Execute llama-cli |
| 109 | +- `run_bench_command(model_path, args, timeout)` - Execute llama-bench |
| 110 | +- `test_context_management(prompts, max_context)` - Test context handling |
| 111 | +- `validate_kv_cache_behavior(context_size, tokens)` - Validate cache usage |
| 112 | + |
| 113 | +### Test Fixtures |
| 114 | + |
| 115 | +New pytest fixtures in `conftest.py`: |
| 116 | + |
| 117 | +- **`pipeline_process`** - PipelineTestProcess instance with automatic cleanup |
| 118 | +- **`e2e_small_model_config`** - Small model config for fast E2E tests |
| 119 | +- **`e2e_embedding_model_config`** - Embedding model configuration |
| 120 | +- **`e2e_multimodal_model_config`** - Multimodal model configuration |
| 121 | +- **`concurrent_test_prompts`** - Prompts for concurrent testing |
| 122 | + |
| 123 | +## Running E2E Tests |
| 124 | + |
| 125 | +### Run All E2E Tests |
| 126 | + |
| 127 | +```bash |
| 128 | +./tests.sh e2e/ |
| 129 | +``` |
| 130 | + |
| 131 | +### Run Specific Test File |
| 132 | + |
| 133 | +```bash |
| 134 | +./tests.sh e2e/test_pipeline_workflows.py |
| 135 | +``` |
| 136 | + |
| 137 | +### Run Single Test |
| 138 | + |
| 139 | +```bash |
| 140 | +./tests.sh e2e/test_pipeline_workflows.py::test_basic_pipeline_workflow |
| 141 | +``` |
| 142 | + |
| 143 | +### Run with Verbose Output |
| 144 | + |
| 145 | +```bash |
| 146 | +DEBUG=1 ./tests.sh e2e/ -s -v |
| 147 | +``` |
| 148 | + |
| 149 | +### Run Slow Tests |
| 150 | + |
| 151 | +Some tests are marked as slow and require the `SLOW_TESTS` environment variable: |
| 152 | + |
| 153 | +```bash |
| 154 | +SLOW_TESTS=1 ./tests.sh e2e/ |
| 155 | +``` |
| 156 | + |
| 157 | +## Configuration |
| 158 | + |
| 159 | +### Environment Variables |
| 160 | + |
| 161 | +| Variable | Description | Default | |
| 162 | +|----------|-------------|---------| |
| 163 | +| `LLAMA_CLI_BIN_PATH` | Path to llama-cli binary | `../../../build/bin/llama-cli` | |
| 164 | +| `LLAMA_BENCH_BIN_PATH` | Path to llama-bench binary | `../../../build/bin/llama-bench` | |
| 165 | +| `LLAMA_CACHE` | Model cache directory | `tmp` | |
| 166 | +| `SLOW_TESTS` | Enable slow tests | `0` | |
| 167 | +| `DEBUG` | Enable verbose output | `0` | |
| 168 | + |
| 169 | +### Model Selection |
| 170 | + |
| 171 | +E2E tests use smaller models for CI compatibility: |
| 172 | + |
| 173 | +- **Text Generation**: tinyllama (stories260K.gguf) - Fast, small footprint |
| 174 | +- **Embeddings**: bert-bge-small - Efficient embedding generation |
| 175 | +- **Multimodal**: tinygemma3 - Compact vision+text model |
| 176 | + |
| 177 | +For local testing with larger models, modify the fixture configurations in `conftest.py`. |
| 178 | + |
| 179 | +## Writing New E2E Tests |
| 180 | + |
| 181 | +### Example Test Structure |
| 182 | + |
| 183 | +```python |
| 184 | +def test_my_e2e_workflow(pipeline_process, e2e_small_model_config): |
| 185 | + """ |
| 186 | + Test description here. |
| 187 | + |
| 188 | + Validates: |
| 189 | + - Point 1 |
| 190 | + - Point 2 |
| 191 | + """ |
| 192 | + # Configure pipeline |
| 193 | + for key, value in e2e_small_model_config.items(): |
| 194 | + if hasattr(pipeline_process, key): |
| 195 | + setattr(pipeline_process, key, value) |
| 196 | + |
| 197 | + # Start server |
| 198 | + pipeline_process.start() |
| 199 | + |
| 200 | + # Test workflow |
| 201 | + res = pipeline_process.make_request("POST", "/completion", data={ |
| 202 | + "prompt": "Test", |
| 203 | + "n_predict": 8, |
| 204 | + }) |
| 205 | + |
| 206 | + # Assertions |
| 207 | + assert res.status_code == 200 |
| 208 | + assert "content" in res.body |
| 209 | +``` |
| 210 | + |
| 211 | +### Best Practices |
| 212 | + |
| 213 | +1. **Use Fixtures**: Leverage existing fixtures for model configs and test data |
| 214 | +2. **Small Models**: Use small models for fast execution in CI |
| 215 | +3. **Resource Cleanup**: Fixtures handle cleanup automatically |
| 216 | +4. **Test Isolation**: Each test should be independent |
| 217 | +5. **Descriptive Names**: Use clear, descriptive test names |
| 218 | +6. **Documentation**: Include docstrings explaining what is validated |
| 219 | +7. **Slow Tests**: Mark expensive tests with `@pytest.mark.skipif(not is_slow_test_allowed())` |
| 220 | + |
| 221 | +## CI Integration |
| 222 | + |
| 223 | +E2E tests are designed to run in CI environments with: |
| 224 | + |
| 225 | +- 4 vCPU GitHub runners |
| 226 | +- Limited memory footprint |
| 227 | +- Fast model downloads from HuggingFace |
| 228 | +- Reasonable timeout configurations |
| 229 | + |
| 230 | +Tests automatically skip slow scenarios unless `SLOW_TESTS=1` is set. |
| 231 | + |
| 232 | +## Troubleshooting |
| 233 | + |
| 234 | +### Tests Timeout |
| 235 | + |
| 236 | +- Increase timeout in test: `pipeline_process.start(timeout_seconds=120)` |
| 237 | +- Use smaller models in CI |
| 238 | +- Check network connectivity for model downloads |
| 239 | + |
| 240 | +### Model Download Issues |
| 241 | + |
| 242 | +- Set `LLAMA_CACHE` to a persistent directory |
| 243 | +- Pre-download models before running tests |
| 244 | +- Check HuggingFace availability |
| 245 | + |
| 246 | +### CLI Tool Not Found |
| 247 | + |
| 248 | +- Ensure binaries are built: `cmake --build build --target llama-cli llama-bench` |
| 249 | +- Set `LLAMA_CLI_BIN_PATH` and `LLAMA_BENCH_BIN_PATH` |
| 250 | +- Check binary permissions |
| 251 | + |
| 252 | +### Concurrent Test Failures |
| 253 | + |
| 254 | +- Increase `n_slots` for higher concurrency |
| 255 | +- Adjust timing expectations for slower systems |
| 256 | +- Enable `server_continuous_batching` for better scheduling |
| 257 | + |
| 258 | +## Contributing |
| 259 | + |
| 260 | +When adding new E2E tests: |
| 261 | + |
| 262 | +1. Place tests in appropriate file based on category |
| 263 | +2. Use existing fixtures when possible |
| 264 | +3. Add new fixtures to `conftest.py` if needed |
| 265 | +4. Update this README with new test descriptions |
| 266 | +5. Ensure tests pass in CI environment |
| 267 | +6. Document special requirements or configurations |
| 268 | + |
| 269 | +## Related Documentation |
| 270 | + |
| 271 | +- [Main Test README](../README.md) - General testing documentation |
| 272 | +- [Server Documentation](../../README.md) - llama-server documentation |
| 273 | +- [Contributing Guide](../../../../CONTRIBUTING.md) - Project contribution guidelines |
0 commit comments