Skip to content

Commit 33c264d

Browse files
Xunzhuohmellor
andauthored
Update _posts/2025-10-25-semantic-router-modular.md
Co-authored-by: Harry Mellor <[email protected]>
1 parent e8a9ef8 commit 33c264d

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2025-10-25-semantic-router-modular.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ The benefits of this architecture vary by workload:
115115
- Single vs multi-task classification: LoRA provides minimal benefit since there's no base model sharing. Traditional fine-tuned models may be faster. LoRA shows clear advantages when performing multiple classifications on the same input. Since the base model runs once and only LoRA adapters execute for each task, the overhead is substantially reduced compared to running separate full models. The actual speedup depends on the ratio of base model computation to adapter computation.
116116
- Long-context inputs: Qwen3-Embedding enables routing decisions on documents up to 32K tokens without truncation, extending beyond ModernBERT's 8K limit for very long documents. With Flash Attention 2 enabled on compatible GPUs, the performance advantage becomes more substantial as context length increases.
117117
- Multilingual routing: Models can now handle routing decisions for languages where ModernBERT has limited training data.
118-
- High concurrency: OnceLock eliminates lock contention, allowing throughput to scale with CPU cores for classification operations.
118+
- High concurrency: `OnceLock` eliminates lock contention, allowing throughput to scale with CPU cores for classification operations.
119119
- GPU acceleration: When Flash Attention 2 is enabled, attention operations run 3-4× faster, with the speedup becoming more pronounced at longer sequence lengths. This makes GPU deployment particularly advantageous for high-throughput scenarios.
120120

121121
## Future Directions

0 commit comments

Comments
 (0)