Skip to content

Conversation

@rootfs
Copy link
Collaborator

@rootfs rootfs commented Sep 28, 2025

  • Restructure codebase into modular layers (core/, ffi/, model_architectures/, classifiers/)
  • Add unified error handling and configuration loading systems
  • Implement dual-path architecture for traditional and LoRA models
  • Add comprehensive FFI layer with memory safety

Maintains backward compatibility while enabling future model integrations.

refactor: Implement modular candle-binding architecture 

  • Restructure codebase into modular layers (core/, ffi/, model_architectures/, classifiers/)
  • Add unified error handling and configuration loading systems
  • Implement dual-path architecture for traditional and LoRA models
  • Add comprehensive FFI layer with memory safety

Maintains backward compatibility while enabling future model integrations.

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #485

Release Notes: Yes/No

@rootfs rootfs requested a review from Xunzhuo as a code owner September 28, 2025 12:40
@netlify
Copy link

netlify bot commented Sep 28, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 9adab4e
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68fb90afaa796b0008b456a5
😎 Deploy Preview https://deploy-preview-266--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

github-actions bot commented Sep 28, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 candle-binding

Owners: @rootfs
Files changed:

  • candle-binding/src/classifiers/lora/intent_lora.rs
  • candle-binding/src/classifiers/lora/intent_lora_test.rs
  • candle-binding/src/classifiers/lora/mod.rs
  • candle-binding/src/classifiers/lora/parallel_engine.rs
  • candle-binding/src/classifiers/lora/parallel_engine_test.rs
  • candle-binding/src/classifiers/lora/pii_lora.rs
  • candle-binding/src/classifiers/lora/pii_lora_test.rs
  • candle-binding/src/classifiers/lora/security_lora.rs
  • candle-binding/src/classifiers/lora/security_lora_test.rs
  • candle-binding/src/classifiers/lora/token_lora.rs
  • candle-binding/src/classifiers/lora/token_lora_test.rs
  • candle-binding/src/classifiers/mod.rs
  • candle-binding/src/classifiers/traditional/batch_processor.rs
  • candle-binding/src/classifiers/traditional/batch_processor_test.rs
  • candle-binding/src/classifiers/traditional/mod.rs
  • candle-binding/src/classifiers/traditional/modernbert_classifier.rs
  • candle-binding/src/classifiers/traditional/modernbert_classifier_test.rs
  • candle-binding/src/classifiers/unified.rs
  • candle-binding/src/classifiers/unified_test.rs
  • candle-binding/src/core/config_loader.rs
  • candle-binding/src/core/config_loader_test.rs
  • candle-binding/src/core/mod.rs
  • candle-binding/src/core/similarity.rs
  • candle-binding/src/core/similarity_test.rs
  • candle-binding/src/core/tokenization.rs
  • candle-binding/src/core/tokenization_test.rs
  • candle-binding/src/core/unified_error.rs
  • candle-binding/src/core/unified_error_test.rs
  • candle-binding/src/ffi/classify.rs
  • candle-binding/src/ffi/classify_test.rs
  • candle-binding/src/ffi/embedding.rs
  • candle-binding/src/ffi/embedding_test.rs
  • candle-binding/src/ffi/init.rs
  • candle-binding/src/ffi/init_test.rs
  • candle-binding/src/ffi/memory.rs
  • candle-binding/src/ffi/memory_safety.rs
  • candle-binding/src/ffi/memory_safety_test.rs
  • candle-binding/src/ffi/mod.rs
  • candle-binding/src/ffi/oncelock_concurrent_test.rs
  • candle-binding/src/ffi/similarity.rs
  • candle-binding/src/ffi/state_manager.rs
  • candle-binding/src/ffi/state_manager_test.rs
  • candle-binding/src/ffi/tokenization.rs
  • candle-binding/src/ffi/types.rs
  • candle-binding/src/ffi/validation.rs
  • candle-binding/src/ffi/validation_test.rs
  • candle-binding/src/model_architectures/config.rs
  • candle-binding/src/model_architectures/embedding/dense_layers.rs
  • candle-binding/src/model_architectures/embedding/dense_layers_test.rs
  • candle-binding/src/model_architectures/embedding/gemma3_model.rs
  • candle-binding/src/model_architectures/embedding/gemma3_model_test.rs
  • candle-binding/src/model_architectures/embedding/gemma_embedding.rs
  • candle-binding/src/model_architectures/embedding/gemma_embedding_test.rs
  • candle-binding/src/model_architectures/embedding/mod.rs
  • candle-binding/src/model_architectures/embedding/pooling.rs
  • candle-binding/src/model_architectures/embedding/pooling_test.rs
  • candle-binding/src/model_architectures/embedding/qwen3_embedding.rs
  • candle-binding/src/model_architectures/embedding/qwen3_embedding_test.rs
  • candle-binding/src/model_architectures/lora/bert_lora.rs
  • candle-binding/src/model_architectures/lora/bert_lora_test.rs
  • candle-binding/src/model_architectures/lora/lora_adapter.rs
  • candle-binding/src/model_architectures/lora/lora_adapter_test.rs
  • candle-binding/src/model_architectures/lora/mod.rs
  • candle-binding/src/model_architectures/mod.rs
  • candle-binding/src/model_architectures/model_factory.rs
  • candle-binding/src/model_architectures/model_factory_test.rs
  • candle-binding/src/model_architectures/routing.rs
  • candle-binding/src/model_architectures/routing_test.rs
  • candle-binding/src/model_architectures/traditional/base_model.rs
  • candle-binding/src/model_architectures/traditional/base_model_test.rs
  • candle-binding/src/model_architectures/traditional/bert.rs
  • candle-binding/src/model_architectures/traditional/bert_test.rs
  • candle-binding/src/model_architectures/traditional/mod.rs
  • candle-binding/src/model_architectures/traditional/modernbert.rs
  • candle-binding/src/model_architectures/traditional/modernbert_test.rs
  • candle-binding/src/model_architectures/traits.rs
  • candle-binding/src/model_architectures/unified_interface.rs
  • candle-binding/src/model_architectures/unified_interface_test.rs
  • candle-binding/src/test_fixtures.rs
  • candle-binding/src/utils/memory.rs
  • candle-binding/src/utils/mod.rs
  • candle-binding/test_data/gemma_reference_outputs.json
  • candle-binding/test_data/qwen3_reference_outputs.json
  • candle-binding/Cargo.lock
  • candle-binding/Cargo.toml
  • candle-binding/semantic-router.go
  • candle-binding/semantic-router_test.go
  • candle-binding/src/lib.rs

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • scripts/generate_gemma_reference.py
  • scripts/generate_qwen3_reference.py
  • .github/workflows/test-and-build.yml

📁 config

Owners: @rootfs
Files changed:

  • config/config.yaml

📁 deploy

Owners: @rootfs, @Xunzhuo
Files changed:

  • deploy/kubernetes/config.yaml
  • deploy/openshift/config-openshift.yaml

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/semantic-router/cmd/main.go
  • src/semantic-router/pkg/api/server.go
  • src/semantic-router/pkg/apis/vllm.ai/v1alpha1/filter_types.go
  • src/semantic-router/pkg/cache/cache_factory.go
  • src/semantic-router/pkg/cache/cache_interface.go
  • src/semantic-router/pkg/cache/cache_test.go
  • src/semantic-router/pkg/cache/inmemory_cache.go
  • src/semantic-router/pkg/config/config.go
  • src/semantic-router/pkg/config/config_test.go
  • src/semantic-router/pkg/extproc/caching_test.go
  • src/semantic-router/pkg/extproc/router.go
  • src/semantic-router/pkg/extproc/test_utils_test.go

📁 tools

Owners: @yuluo-yx, @rootfs, @Xunzhuo
Files changed:

  • tools/make/build-run-test.mk
  • tools/make/common.mk
  • tools/make/golang.mk
  • tools/make/models.mk
  • tools/make/rust.mk

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@rootfs
Copy link
Collaborator Author

rootfs commented Sep 28, 2025

@OneZero-Y @Xunzhuo Let's have the following resolved before merging

  • Add more candle unit tests
  • Verify API accuracy
  • Ensure semantic-router use the right binding API
  • Remove legacy comment and code

@rootfs rootfs added this to the v0.1 milestone Sep 28, 2025
@rootfs
Copy link
Collaborator Author

rootfs commented Sep 30, 2025

@OneZero-Y now since we work on the feature branch, how about you use this branch for both refactoring and new embedding models?

@rootfs rootfs mentioned this pull request Oct 5, 2025
@OneZero-Y
Copy link
Contributor

@rootfs OK, I'll advance the embedded model on this branch

@rootfs
Copy link
Collaborator Author

rootfs commented Oct 9, 2025

@OneZero-Y that's great! I'll switch to this work as soon as i can.

Comment on lines 89 to 104
let handles = vec![
self.spawn_intent_task(texts_owned.clone(), Arc::clone(&intent_results)),
self.spawn_pii_task(texts_owned.clone(), Arc::clone(&pii_results)),
self.spawn_security_task(texts_owned, Arc::clone(&security_results)),
];

// Wait for all threads to complete
for handle in handles {
handle.join().map_err(|_| {
let unified_err = concurrency_error(
"thread join",
"Failed to join parallel classification thread",
);
candle_core::Error::from(unified_err)
})?;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified a bit. Something like

let intent_handle = thread::spawn(|| intent_task(texts)); // slice is fine, no need to own the data.
let pii_handle = ... same
let security_handle = ... same

let intent_results = intent_handle.join()?; // map_err omitted
let pii_results = pii_handle.join()?;
let security_results = security_handle.join()?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're on the topic of threads - you may like some of the abstractions that the rayon crate provides

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ivarflakstad

Comment on lines 162 to 174
pub fn parallel_detect(&self, texts: &[&str]) -> Result<Vec<PIIResult>> {
let mut results = Vec::new();
for text in texts {
results.push(self.detect_pii(text)?);
}
Ok(results)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want this to be in parallel you could do something like

// add `use rayon::prelude::*;` at top of file 
Ok(texts.par_iter().map(|text| self.detect_pii(text)?).collect())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I'm starting to suspect that what you actually want, for the long term, is an async runtime.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivarflakstad thanks for looking into this. On a separate note, for async to run most efficiently, would you help look at the if locking is done the right way?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :)
Are you thinking about any specific locks in particular? (pr is fairly large 😉 )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ivarflakstad

The classify_text is currently protected under lock. This could get us performance hit, would you help share your ideas? Thanks

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.
I'd definitely look into using OnceCell instead of lazy_static.

But it depends. Will you actually be updating these static values at runtime? More than once?

If yes, then at the very least you want to use RwLock instead of Mutex because I doubt you're planning to write to the value as much as you read it.

If OnceCell doesn't cut it, perhaps you'll want to give OnceLock, LazyCell, or LazyLock a try :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @ivarflakstad

can you review #528?

@OneZero-Y
Copy link
Contributor

@rootfs
Flash Attention 2 Testing Issues
The self.scale → self.scaling fix is correct through compilation (CUDA 12.9, requires candle 0.9.1 upgrade from 0.8.4) and code analysis.

Flash Attention 2 enabled (feature flag active)
   Status: Flash Attention 2 fully integrated (2-3x faster for long sequences)
   Performance: Optimized for 8K-32K token sequences
✅ Qwen3EmbeddingModel loaded successfully:
   - Model: ../models/Qwen3-Embedding-0.6B
   - Layers: 28
   - Hidden size: 1024
   - Attention heads: 16
   - KV heads (GQA): 8
   - Max seq length: 32768
   - RoPE theta: 1000000
   - Padding side: Left (CRITICAL: must be Left)
 ✅ Qwen3-Embedding-0.6B loaded successfully in 22.65s

though runtime testing is blocked by incompatible local GPU hardware (GT 730 CC 3.5 < required CC 8.0 for Flash Attention 2). Could you help verify with a compatible GPU if available?

🔍 Attempting to detect CUDA device...
✅ Device::new_cuda(0) succeeded: Cuda(CudaDevice(DeviceId(1)))
✅ Using CUDA GPU for testing
❌ Error: DriverError(CUDA_ERROR_UNSUPPORTED_PTX_VERSION, "the provided PTX was compiled with an unsupported toolchain.")

@rootfs
Copy link
Collaborator Author

rootfs commented Oct 21, 2025

@OneZero-Y sure, I am using L4, the unit test passed on my end, let me run again and post the test log

@rootfs
Copy link
Collaborator Author

rootfs commented Oct 21, 2025

@OneZero-Y here is my local test results using PR #489
#489 (comment)

rootfs and others added 9 commits October 23, 2025 12:52
Signed-off-by: Huamin Chen <[email protected]>
…ntention (#516)

- Remove duplicate UNIFIED_CLASSIFIER global state
- Optimize PARALLEL_LORA_ENGINE lock contention by using Arc clone

Signed-off-by: OneZero-Y <[email protected]>
* Update test description from Math to General (#483)

Signed-off-by: carlory <[email protected]>

* feat: add HuggingChat support (#477)

* add chat ui to dashboard and docker compose & refactor dashboard/backend/

Signed-off-by: JaredforReal <[email protected]>

* try fix network error

Signed-off-by: JaredforReal <[email protected]>

* more

---------

Signed-off-by: JaredforReal <[email protected]>
Co-authored-by: bitliu <[email protected]>

* project: 2025 Q4 roadmap (#487)

* project: q4 roadmap

* project: q4 roadmap

* project: q4 roadmap

* more

* more

* more

* more

* feat: add shelleck precommit hook (#488)

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

---------

Signed-off-by: yuluo-yx <[email protected]>

* project: add q4 roadmap news (#495)

* fix missing shellcheck in pre-commit image (#497)

Signed-off-by: carlory <[email protected]>

* infra: update tools (#501)

Signed-off-by: yuluo-yx <[email protected]>

* feat(demo): enhance OpenShift demo scripts with improved UX (#478)

- Reduce model selection test to 4 categories (2×Model-A, 2×Model-B)
- Add new "Classification Examples" option calling curl-examples.sh
- Update reasoning examples to avoid cache hits from previous tests
- Remove benign examples from PII and Jailbreak tests (show only attacks)
- Enhance live-semantic-router-logs.sh with better color visibility:
  - Fix duplicate "WITH SCORE" text in classification output
  - Fix CACHE HIT background color extending over timestamp
  - Distinguish reasoning enabled vs disabled messages
  - Remove redundant "(standard routing)" text
  - Add background colors for Model-A/Model-B routing display

These improvements make the live demo clearer and more impactful for
presentations and demonstrations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>

* fix: fix precommit Argument list too long error (#502)

Signed-off-by: yuluo-yx <[email protected]>

* feat: enforce milvus dial timeout if set (#503)

Signed-off-by: cryo <[email protected]>

* Add IETF draft publication: Multi-Provider Extensions for Agentic AI Inference APIs (#506)

* Initial plan

* Add new IETF draft publication for Multi-Provider Extensions for Agentic AI Inference APIs

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>

* Allow semantic cache similarity threshold to be set at the category level (#493)

* Initial plan

* Add category-level cache settings: enabled and similarity_threshold

Co-authored-by: rootfs <[email protected]>

* Add comprehensive tests for category-level cache settings

Co-authored-by: rootfs <[email protected]>

* Update config files and documentation for category-level cache settings

- Updated 7 config YAML files (development, production, testing, e2e, and 3 recipes) with commented examples of category-level cache settings
- Added comprehensive documentation section explaining category-level cache configuration
- Updated semantic cache overview and in-memory cache docs with category-level examples
- Added best practices for threshold selection and privacy considerations

Co-authored-by: rootfs <[email protected]>

* Remove duplicate code in FindSimilar functions

Refactored FindSimilar() to delegate to FindSimilarWithThreshold() with default threshold instead of duplicating the entire implementation. This eliminates 226 lines of duplicate code across inmemory_cache.go and milvus_cache.go.

Co-authored-by: rootfs <[email protected]>

* Update src/semantic-router/pkg/extproc/request_handler.go

Co-authored-by: Copilot <[email protected]>

* Revert changes from unsigned commit ae39fe2

Restored the classificationText empty check that was removed in the previous commit.

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Huamin Chen <[email protected]>
Co-authored-by: Copilot <[email protected]>

* Allow jailbreak detection and threshold to be configured at the category level (#508)

* Initial plan

* Add category-level jailbreak detection configuration

Co-authored-by: Xunzhuo <[email protected]>

* Add documentation for category-level jailbreak settings

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation for category-level jailbreak detection

- Add category-level jailbreak configuration to jailbreak-protection.md
- Update category configuration docs with jailbreak_enabled parameter
- Add security-focused configuration example
- Update global configuration docs with category override notes
- Update README to mention fine-grained security control

Co-authored-by: Xunzhuo <[email protected]>

* Add category-level jailbreak threshold configuration

- Add JailbreakThreshold field to Category struct
- Add GetJailbreakThresholdForCategory helper method
- Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods
- Update performSecurityChecks to use category-specific threshold
- Add 5 comprehensive tests for threshold configuration
- Update example configs with threshold tuning examples
- Update documentation with threshold configuration and tuning guidelines
- Add threshold tuning guide with recommendations for different category types

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Allow PII detection threshold to be set at the category level (#510)

* Initial plan

* Add category-level PII threshold support

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation with API integration notes

Co-authored-by: Xunzhuo <[email protected]>

* Fix markdown linting issues

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Fix: The caller information points to the wrapper function instead of the actual call location (#518)

Signed-off-by: carlory <[email protected]>

* feat: Implement hybrid cache that use in-memory index and milvus based doc store (#504)

* feat: add HNSW index to inmemory semantic cache and implement hybrid cache that use in-memory index and milvus based doc store

Signed-off-by: Huamin Chen <[email protected]>

* chore: run go mod tidy to clean up module dependencies

Signed-off-by: Huamin Chen <[email protected]>

* conditionally build candle cuda support

Signed-off-by: Huamin Chen <[email protected]>

* rebuild index upon restart

Signed-off-by: Huamin Chen <[email protected]>

* precommit fix

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* disable cuda build on ci

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: carlory <[email protected]>
Signed-off-by: JaredforReal <[email protected]>
Signed-off-by: yuluo-yx <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Signed-off-by: cryo <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: 杨朱 · Kiki <[email protected]>
Co-authored-by: Jared <[email protected]>
Co-authored-by: bitliu <[email protected]>
Co-authored-by: shown <[email protected]>
Co-authored-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: cryo <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>
* Update test description from Math to General (#483)

Signed-off-by: carlory <[email protected]>

* feat: add HuggingChat support (#477)

* add chat ui to dashboard and docker compose & refactor dashboard/backend/

Signed-off-by: JaredforReal <[email protected]>

* try fix network error

Signed-off-by: JaredforReal <[email protected]>

* more

---------

Signed-off-by: JaredforReal <[email protected]>
Co-authored-by: bitliu <[email protected]>

* project: 2025 Q4 roadmap (#487)

* project: q4 roadmap

* project: q4 roadmap

* project: q4 roadmap

* more

* more

* more

* more

* feat: add shelleck precommit hook (#488)

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

---------

Signed-off-by: yuluo-yx <[email protected]>

* project: add q4 roadmap news (#495)

* fix missing shellcheck in pre-commit image (#497)

Signed-off-by: carlory <[email protected]>

* infra: update tools (#501)

Signed-off-by: yuluo-yx <[email protected]>

* feat(demo): enhance OpenShift demo scripts with improved UX (#478)

- Reduce model selection test to 4 categories (2×Model-A, 2×Model-B)
- Add new "Classification Examples" option calling curl-examples.sh
- Update reasoning examples to avoid cache hits from previous tests
- Remove benign examples from PII and Jailbreak tests (show only attacks)
- Enhance live-semantic-router-logs.sh with better color visibility:
  - Fix duplicate "WITH SCORE" text in classification output
  - Fix CACHE HIT background color extending over timestamp
  - Distinguish reasoning enabled vs disabled messages
  - Remove redundant "(standard routing)" text
  - Add background colors for Model-A/Model-B routing display

These improvements make the live demo clearer and more impactful for
presentations and demonstrations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>

* fix: fix precommit Argument list too long error (#502)

Signed-off-by: yuluo-yx <[email protected]>

* feat: enforce milvus dial timeout if set (#503)

Signed-off-by: cryo <[email protected]>

* Add IETF draft publication: Multi-Provider Extensions for Agentic AI Inference APIs (#506)

* Initial plan

* Add new IETF draft publication for Multi-Provider Extensions for Agentic AI Inference APIs

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>

* Allow semantic cache similarity threshold to be set at the category level (#493)

* Initial plan

* Add category-level cache settings: enabled and similarity_threshold

Co-authored-by: rootfs <[email protected]>

* Add comprehensive tests for category-level cache settings

Co-authored-by: rootfs <[email protected]>

* Update config files and documentation for category-level cache settings

- Updated 7 config YAML files (development, production, testing, e2e, and 3 recipes) with commented examples of category-level cache settings
- Added comprehensive documentation section explaining category-level cache configuration
- Updated semantic cache overview and in-memory cache docs with category-level examples
- Added best practices for threshold selection and privacy considerations

Co-authored-by: rootfs <[email protected]>

* Remove duplicate code in FindSimilar functions

Refactored FindSimilar() to delegate to FindSimilarWithThreshold() with default threshold instead of duplicating the entire implementation. This eliminates 226 lines of duplicate code across inmemory_cache.go and milvus_cache.go.

Co-authored-by: rootfs <[email protected]>

* Update src/semantic-router/pkg/extproc/request_handler.go

Co-authored-by: Copilot <[email protected]>

* Revert changes from unsigned commit ae39fe2

Restored the classificationText empty check that was removed in the previous commit.

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Huamin Chen <[email protected]>
Co-authored-by: Copilot <[email protected]>

* Allow jailbreak detection and threshold to be configured at the category level (#508)

* Initial plan

* Add category-level jailbreak detection configuration

Co-authored-by: Xunzhuo <[email protected]>

* Add documentation for category-level jailbreak settings

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation for category-level jailbreak detection

- Add category-level jailbreak configuration to jailbreak-protection.md
- Update category configuration docs with jailbreak_enabled parameter
- Add security-focused configuration example
- Update global configuration docs with category override notes
- Update README to mention fine-grained security control

Co-authored-by: Xunzhuo <[email protected]>

* Add category-level jailbreak threshold configuration

- Add JailbreakThreshold field to Category struct
- Add GetJailbreakThresholdForCategory helper method
- Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods
- Update performSecurityChecks to use category-specific threshold
- Add 5 comprehensive tests for threshold configuration
- Update example configs with threshold tuning examples
- Update documentation with threshold configuration and tuning guidelines
- Add threshold tuning guide with recommendations for different category types

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Allow PII detection threshold to be set at the category level (#510)

* Initial plan

* Add category-level PII threshold support

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation with API integration notes

Co-authored-by: Xunzhuo <[email protected]>

* Fix markdown linting issues

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Fix: The caller information points to the wrapper function instead of the actual call location (#518)

Signed-off-by: carlory <[email protected]>

* feat: Implement hybrid cache that use in-memory index and milvus based doc store (#504)

* feat: add HNSW index to inmemory semantic cache and implement hybrid cache that use in-memory index and milvus based doc store

Signed-off-by: Huamin Chen <[email protected]>

* chore: run go mod tidy to clean up module dependencies

Signed-off-by: Huamin Chen <[email protected]>

* conditionally build candle cuda support

Signed-off-by: Huamin Chen <[email protected]>

* rebuild index upon restart

Signed-off-by: Huamin Chen <[email protected]>

* precommit fix

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* disable cuda build on ci

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: carlory <[email protected]>
Signed-off-by: JaredforReal <[email protected]>
Signed-off-by: yuluo-yx <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Signed-off-by: cryo <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: 杨朱 · Kiki <[email protected]>
Co-authored-by: Jared <[email protected]>
Co-authored-by: bitliu <[email protected]>
Co-authored-by: shown <[email protected]>
Co-authored-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: cryo <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>
* Update test description from Math to General (#483)

Signed-off-by: carlory <[email protected]>

* feat: add HuggingChat support (#477)

* add chat ui to dashboard and docker compose & refactor dashboard/backend/

Signed-off-by: JaredforReal <[email protected]>

* try fix network error

Signed-off-by: JaredforReal <[email protected]>

* more

---------

Signed-off-by: JaredforReal <[email protected]>
Co-authored-by: bitliu <[email protected]>

* project: 2025 Q4 roadmap (#487)

* project: q4 roadmap

* project: q4 roadmap

* project: q4 roadmap

* more

* more

* more

* more

* feat: add shelleck precommit hook (#488)

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

* feat: add shelleck precommit hook

Signed-off-by: yuluo-yx <[email protected]>

---------

Signed-off-by: yuluo-yx <[email protected]>

* project: add q4 roadmap news (#495)

* fix missing shellcheck in pre-commit image (#497)

Signed-off-by: carlory <[email protected]>

* infra: update tools (#501)

Signed-off-by: yuluo-yx <[email protected]>

* feat(demo): enhance OpenShift demo scripts with improved UX (#478)

- Reduce model selection test to 4 categories (2×Model-A, 2×Model-B)
- Add new "Classification Examples" option calling curl-examples.sh
- Update reasoning examples to avoid cache hits from previous tests
- Remove benign examples from PII and Jailbreak tests (show only attacks)
- Enhance live-semantic-router-logs.sh with better color visibility:
  - Fix duplicate "WITH SCORE" text in classification output
  - Fix CACHE HIT background color extending over timestamp
  - Distinguish reasoning enabled vs disabled messages
  - Remove redundant "(standard routing)" text
  - Add background colors for Model-A/Model-B routing display

These improvements make the live demo clearer and more impactful for
presentations and demonstrations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>

* fix: fix precommit Argument list too long error (#502)

Signed-off-by: yuluo-yx <[email protected]>

* feat: enforce milvus dial timeout if set (#503)

Signed-off-by: cryo <[email protected]>

* Add IETF draft publication: Multi-Provider Extensions for Agentic AI Inference APIs (#506)

* Initial plan

* Add new IETF draft publication for Multi-Provider Extensions for Agentic AI Inference APIs

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>

* Allow semantic cache similarity threshold to be set at the category level (#493)

* Initial plan

* Add category-level cache settings: enabled and similarity_threshold

Co-authored-by: rootfs <[email protected]>

* Add comprehensive tests for category-level cache settings

Co-authored-by: rootfs <[email protected]>

* Update config files and documentation for category-level cache settings

- Updated 7 config YAML files (development, production, testing, e2e, and 3 recipes) with commented examples of category-level cache settings
- Added comprehensive documentation section explaining category-level cache configuration
- Updated semantic cache overview and in-memory cache docs with category-level examples
- Added best practices for threshold selection and privacy considerations

Co-authored-by: rootfs <[email protected]>

* Remove duplicate code in FindSimilar functions

Refactored FindSimilar() to delegate to FindSimilarWithThreshold() with default threshold instead of duplicating the entire implementation. This eliminates 226 lines of duplicate code across inmemory_cache.go and milvus_cache.go.

Co-authored-by: rootfs <[email protected]>

* Update src/semantic-router/pkg/extproc/request_handler.go

Co-authored-by: Copilot <[email protected]>

* Revert changes from unsigned commit ae39fe2

Restored the classificationText empty check that was removed in the previous commit.

Co-authored-by: rootfs <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Huamin Chen <[email protected]>
Co-authored-by: Copilot <[email protected]>

* Allow jailbreak detection and threshold to be configured at the category level (#508)

* Initial plan

* Add category-level jailbreak detection configuration

Co-authored-by: Xunzhuo <[email protected]>

* Add documentation for category-level jailbreak settings

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation for category-level jailbreak detection

- Add category-level jailbreak configuration to jailbreak-protection.md
- Update category configuration docs with jailbreak_enabled parameter
- Add security-focused configuration example
- Update global configuration docs with category override notes
- Update README to mention fine-grained security control

Co-authored-by: Xunzhuo <[email protected]>

* Add category-level jailbreak threshold configuration

- Add JailbreakThreshold field to Category struct
- Add GetJailbreakThresholdForCategory helper method
- Create CheckForJailbreakWithThreshold and AnalyzeContentForJailbreakWithThreshold methods
- Update performSecurityChecks to use category-specific threshold
- Add 5 comprehensive tests for threshold configuration
- Update example configs with threshold tuning examples
- Update documentation with threshold configuration and tuning guidelines
- Add threshold tuning guide with recommendations for different category types

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Allow PII detection threshold to be set at the category level (#510)

* Initial plan

* Add category-level PII threshold support

Co-authored-by: Xunzhuo <[email protected]>

* Update documentation with API integration notes

Co-authored-by: Xunzhuo <[email protected]>

* Fix markdown linting issues

Co-authored-by: Xunzhuo <[email protected]>

---------

Co-authored-by: copilot-swe-agent[bot] <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>

* Fix: The caller information points to the wrapper function instead of the actual call location (#518)

Signed-off-by: carlory <[email protected]>

* feat: Implement hybrid cache that use in-memory index and milvus based doc store (#504)

* feat: add HNSW index to inmemory semantic cache and implement hybrid cache that use in-memory index and milvus based doc store

Signed-off-by: Huamin Chen <[email protected]>

* chore: run go mod tidy to clean up module dependencies

Signed-off-by: Huamin Chen <[email protected]>

* conditionally build candle cuda support

Signed-off-by: Huamin Chen <[email protected]>

* rebuild index upon restart

Signed-off-by: Huamin Chen <[email protected]>

* precommit fix

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* fix precommit

Signed-off-by: Huamin Chen <[email protected]>

* disable cuda build on ci

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

* review feedback

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>

* merge main to feat branch

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: carlory <[email protected]>
Signed-off-by: JaredforReal <[email protected]>
Signed-off-by: yuluo-yx <[email protected]>
Signed-off-by: Yossi Ovadia <[email protected]>
Signed-off-by: cryo <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
Co-authored-by: 杨朱 · Kiki <[email protected]>
Co-authored-by: Jared <[email protected]>
Co-authored-by: bitliu <[email protected]>
Co-authored-by: shown <[email protected]>
Co-authored-by: Yossi Ovadia <[email protected]>
Co-authored-by: Claude <[email protected]>
Co-authored-by: cryo <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: rootfs <[email protected]>
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Xunzhuo <[email protected]>
@rootfs rootfs force-pushed the feat-candle-refactoring branch from 7d84e64 to 3230c35 Compare October 23, 2025 16:57
* chore: fix unit test

Signed-off-by: Huamin Chen <[email protected]>

* fix go vet

Signed-off-by: Huamin Chen <[email protected]>

* fix ci

Signed-off-by: Huamin Chen <[email protected]>

* fix ci

Signed-off-by: Huamin Chen <[email protected]>

* split test-binding to two stages on ci

Signed-off-by: Huamin Chen <[email protected]>

* ignore test failure due to embeddinggemma restriction

Signed-off-by: Huamin Chen <[email protected]>

* reorder ci test sequences to avoid missing models

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>
rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 23, 2025
rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 23, 2025
rootfs added a commit to rootfs/semantic-router.bak that referenced this pull request Oct 24, 2025
@Xunzhuo
Copy link
Member

Xunzhuo commented Oct 24, 2025

🚀🔥

Xunzhuo
Xunzhuo previously approved these changes Oct 24, 2025
@Xunzhuo Xunzhuo changed the title [WIP] refactor: Implement modular candle-binding architecture (#254) refactor: Implement modular candle-binding architecture (#254) Oct 24, 2025
@rootfs
Copy link
Collaborator Author

rootfs commented Oct 24, 2025

@OneZero-Y are you on semantic router slack? or can you reach me via email [email protected]? Let's coauthor a blog post on this progress, thanks.

…reads based on review (#528)

* refactor: Replace lazy_static with OnceLock for zero-cost concurrent reads based on review #266 (comment)

Signed-off-by: Huamin Chen <[email protected]>

* update tests

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
* chore: fix lint error

Signed-off-by: Huamin Chen <[email protected]>

* chore: fix lint error

Signed-off-by: Huamin Chen <[email protected]>

---------

Signed-off-by: Huamin Chen <[email protected]>
@rootfs
Copy link
Collaborator Author

rootfs commented Oct 24, 2025

Thank you all the great work for this significant milestone. I am merging this to main branch to trigger more CI. Will follow up in the new few days on issues and enhancement.

@rootfs rootfs merged commit 7d55463 into main Oct 24, 2025
20 of 21 checks passed
@OneZero-Y
Copy link
Contributor

@rootfs I left you a message at slack

@rootfs
Copy link
Collaborator Author

rootfs commented Oct 25, 2025

@OneZero-Y @ivarflakstad the blog post PR is here vllm-project/vllm-project.github.io#104

@Xunzhuo Xunzhuo deleted the feat-candle-refactoring branch October 26, 2025 05:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: Implement modular candle-binding architecture

10 participants