[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh by AndreiCautisanu · Pull Request #5863 · comet-ml/opik

AndreiCautisanu · 2026-03-25T18:07:44Z

Details

Wires the YAML-based LLM model registry (shipped in OPIK-5019) into the backend's provider routing and structured output detection. After this, new models added to the YAML are automatically routable — no Java enum changes needed.

Also adds remote YAML fetch from a configurable CDN URL with scheduled periodic refresh, enabling new models to reach running deployments without a redeploy. Remote fetch is disabled by default (remoteEnabled: false), so no behavior change for existing deployments.

Key changes:

Registry-first lookup in getLlmProvider() with enum fallback (backward compatible)
New getStructuredOutputStrategy() on LlmProviderFactory — encapsulates registry-based structured output resolution, simplifying all 3 online scoring callers
Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id
Remote YAML fetch via HttpClient with 30s timeout, URL scheme validation (SSRF defense)
ScheduledExecutorService refreshes registry on configurable interval (daemon thread)
3-tier merge: classpath defaults → remote CDN → local customer override
Remote fetch failure is non-fatal at every level (logs warning, keeps previous registry)

Rollout context (OPIK-4866)

This is step 2 of the externalization initiative. The CDN infrastructure is being set up by DevOps in OPIK-5241 (in parallel). Once the CDN URL is available, we can set remoteEnabled: true and add the URL into the config.

Change checklist

User facing
Documentation update

Issues

OPIK-5020
Part of OPIK-4866

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code
- Model(s): Claude Opus 4.6
- Scope: Full implementation with human guidance on design decisions
- Human verification: Code review (2x self-review iterations), local Docker testing, curl endpoint verification, Playground UI validation

Testing

Unit tests (570 passing):

cd apps/opik-backend && mvn test -Dtest="LlmProviderFactoryTest,LlmModelRegistryServiceTest"

Tests cover:

findModel(): qualifiedName match → VertexAI, bare id → OpenAI, bare id with qualifiedName → empty (disambiguation), nonexistent → empty
getStructuredOutputStrategy(): registry model with structuredOutput=true → ToolCallingStrategy, structuredOutput=false → InstructionStrategy
All 555 existing LlmProviderFactoryTest cases pass through the new registry-first path
All 13 LlmModelRegistryServiceTest cases pass (9 existing + 4 new findModel tests)

Quality checks:

mvn compile -DskipTests  # passes
mvn spotless:check        # passes

Local Docker testing:

Verified GET /api/v1/private/llm/models returns 525 models across 5 providers with correct snake_case serialization
Sent messages in Playground and scored traces using various models — full routing path works end-to-end through registry-first lookup

Documentation

N/A — backend-only, no user-visible documentation changes needed. Self-hosted configuration docs will be updated when the full externalization is complete (OPIK-5022).

…add remote refresh - Registry-first lookup in getLlmProvider() with enum fallback - New getStructuredOutputStrategy() on LlmProviderFactory encapsulating registry-based structured output resolution - Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id - Remote YAML fetch via HttpClient with 30s timeout and URL scheme validation - ScheduledExecutorService refreshes registry from CDN on configurable interval - 3-tier merge: classpath defaults → remote CDN → local customer override - Remote fetch failure is non-fatal at every level (logs warning, keeps previous) - remoteEnabled defaults to false — zero behavior change for existing deployments Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java

apps/opik-backend/config.yml

… remote

ldaugusto

Besides the reusage of our infrastructured mentioned in the comments below, I believe we could have tests mocking the remote return for loadRemoteResource(): non-200 responses, malformed YAML, or any other problem you can think of.

ldaugusto · 2026-03-26T11:04:52Z

apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java


    private final LlmModelRegistryConfig config;
-    // volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020)
+    private final HttpClient httpClient;


This remote fetch uses java.net.http.HttpClient (standard JDK), but the rest of the backend standardizes on autoinjected JAX-RS Client; check examples in WorkspaceNameService, OllamaService, and RemoteAuthService.

It's a better choice by consistency, lifecycle management, no need for the threadpool, testability and so.

Commit 710c806 addressed this comment by replacing the java.net.http.HttpClient usage with an injected jakarta.ws.rs.client.Client for remote registry fetches. The new constructor wiring now takes the shared JAX-RS client and loadRemoteRegistry() performs the request through that Client, satisfying the consistency request.

@ldaugusto added the JAX-RS Client, let me know if implementation is ok

ldaugusto · 2026-03-26T11:20:23Z

apps/opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryService.java

    private final LlmModelRegistryConfig config;
-    // volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020)
+    private final HttpClient httpClient;
+    private final ScheduledExecutorService scheduler;


Similarly, for reexecuting stuff at some frequency, you could reuse the same pattern from TraceThreadsClosingJob, DailyUsageReportJob, and so on.

Commit 710c806 addressed this comment by moving the periodic reload out of the service and into the new LlmModelRegistryRefreshJob, which follows the same Dropwizard job/@On pattern used by TraceThreadsClosingJob and DailyUsageReportJob and simply delegates to registryService.reload() on its schedule.

Updated, thanks for pointing out

- Replace java.net.http.HttpClient with JAX-RS Client (injected via Guice) for consistency with WorkspaceNameService, OllamaService, RemoteAuthService - Replace ScheduledExecutorService with LlmModelRegistryRefreshJob Quartz job matching TraceThreadsClosingJob, DailyUsageReportJob patterns - Add remote fetch tests: successful merge, non-200 fallback, malformed YAML fallback, disabled remote skips fetch Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@on

…test Quartz @on cron replaced the ScheduledExecutorService, making the configurable interval unused. Removed to avoid misleading operators. Strengthened disabled-remote test to verify mock client is never called. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

...opik-backend/src/main/java/com/comet/opik/infrastructure/llm/LlmModelRegistryRefreshJob.java

github-actions · 2026-03-27T16:09:29Z