[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863
[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863AndreiCautisanu wants to merge 5 commits intomainfrom
Conversation
…add remote refresh - Registry-first lookup in getLlmProvider() with enum fallback - New getStructuredOutputStrategy() on LlmProviderFactory encapsulating registry-based structured output resolution - Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id - Remote YAML fetch via HttpClient with 30s timeout and URL scheme validation - ScheduledExecutorService refreshes registry from CDN on configurable interval - 3-tier merge: classpath defaults → remote CDN → local customer override - Remote fetch failure is non-fatal at every level (logs warning, keeps previous) - remoteEnabled defaults to false — zero behavior change for existing deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java
Show resolved
Hide resolved
ldaugusto
left a comment
There was a problem hiding this comment.
Besides the reusage of our infrastructured mentioned in the comments below, I believe we could have tests mocking the remote return for loadRemoteResource(): non-200 responses, malformed YAML, or any other problem you can think of.
|
|
||
| private final LlmModelRegistryConfig config; | ||
| // volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020) | ||
| private final HttpClient httpClient; |
There was a problem hiding this comment.
This remote fetch uses java.net.http.HttpClient (standard JDK), but the rest of the backend standardizes on autoinjected JAX-RS Client; check examples in WorkspaceNameService, OllamaService, and RemoteAuthService.
It's a better choice by consistency, lifecycle management, no need for the threadpool, testability and so.
There was a problem hiding this comment.
Commit 710c806 addressed this comment by replacing the java.net.http.HttpClient usage with an injected jakarta.ws.rs.client.Client for remote registry fetches. The new constructor wiring now takes the shared JAX-RS client and loadRemoteRegistry() performs the request through that Client, satisfying the consistency request.
There was a problem hiding this comment.
@ldaugusto added the JAX-RS Client, let me know if implementation is ok
| private final LlmModelRegistryConfig config; | ||
| // volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020) | ||
| private final HttpClient httpClient; | ||
| private final ScheduledExecutorService scheduler; |
There was a problem hiding this comment.
Similarly, for reexecuting stuff at some frequency, you could reuse the same pattern from TraceThreadsClosingJob, DailyUsageReportJob, and so on.
There was a problem hiding this comment.
Commit 710c806 addressed this comment by moving the periodic reload out of the service and into the new LlmModelRegistryRefreshJob, which follows the same Dropwizard job/@On pattern used by TraceThreadsClosingJob and DailyUsageReportJob and simply delegates to registryService.reload() on its schedule.
There was a problem hiding this comment.
Updated, thanks for pointing out
- Replace java.net.http.HttpClient with JAX-RS Client (injected via Guice) for consistency with WorkspaceNameService, OllamaService, RemoteAuthService - Replace ScheduledExecutorService with LlmModelRegistryRefreshJob Quartz job matching TraceThreadsClosingJob, DailyUsageReportJob patterns - Add remote fetch tests: successful merge, non-200 fallback, malformed YAML fallback, disabled remote skips fetch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…test Quartz @on cron replaced the ScheduledExecutorService, making the configurable interval unused. Removed to avoid misleading operators. Strengthened disabled-remote test to verify mock client is never called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Show resolved
Hide resolved
...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java
Show resolved
Hide resolved
Backend Tests - Integration Group 122 tests 0 ✅ 13m 43s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 72 tests 0 ✅ 27m 31s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 152 tests 0 ✅ 13m 43s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 112 tests 0 ✅ 13m 46s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 43 tests 0 ✅ 13m 43s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 52 tests 0 ✅ 13m 46s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 11 tests 0 ✅ 6m 56s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 101 tests 0 ✅ 6m 57s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 21 tests 0 ✅ 5m 12s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 132 tests 0 ✅ 13m 54s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 63 tests 0 ✅ 18m 57s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 31 tests 0 ✅ 6m 52s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 163 tests 0 ✅ 19m 1s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 149 tests 6 ✅ 19m 8s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 82 tests 0 ✅ 13m 54s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Backend Tests - Integration Group 92 tests 0 ✅ 13m 53s ⏱️ For more details on these errors, see this check. Results for commit b25a462. |
Details
Wires the YAML-based LLM model registry (shipped in OPIK-5019) into the backend's provider routing and structured output detection. After this, new models added to the YAML are automatically routable — no Java enum changes needed.
Also adds remote YAML fetch from a configurable CDN URL with scheduled periodic refresh, enabling new models to reach running deployments without a redeploy. Remote fetch is disabled by default (
remoteEnabled: false), so no behavior change for existing deployments.Key changes:
getLlmProvider()with enum fallback (backward compatible)getStructuredOutputStrategy()onLlmProviderFactory— encapsulates registry-based structured output resolution, simplifying all 3 online scoring callersfindModel()disambiguates VertexAI/Gemini by qualifiedName vs bare idHttpClientwith 30s timeout, URL scheme validation (SSRF defense)ScheduledExecutorServicerefreshes registry on configurable interval (daemon thread)Rollout context (OPIK-4866)
This is step 2 of the externalization initiative. The CDN infrastructure is being set up by DevOps in OPIK-5241 (in parallel). Once the CDN URL is available, we can set
remoteEnabled: trueand add the URL into the config.Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
Unit tests (570 passing):
Tests cover:
findModel(): qualifiedName match → VertexAI, bare id → OpenAI, bare id with qualifiedName → empty (disambiguation), nonexistent → emptygetStructuredOutputStrategy(): registry model withstructuredOutput=true→ ToolCallingStrategy,structuredOutput=false→ InstructionStrategyLlmProviderFactoryTestcases pass through the new registry-first pathLlmModelRegistryServiceTestcases pass (9 existing + 4 new findModel tests)Quality checks:
Local Docker testing:
GET /api/v1/private/llm/modelsreturns 525 models across 5 providers with correctsnake_caseserializationDocumentation
N/A — backend-only, no user-visible documentation changes needed. Self-hosted configuration docs will be updated when the full externalization is complete (OPIK-5022).