Skip to content

[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863

Open
AndreiCautisanu wants to merge 5 commits intomainfrom
andreicautisanu/OPIK-5020-wire-registry-routing
Open

[OPIK-5020] [BE] feat: wire model registry into provider routing and add remote refresh#5863
AndreiCautisanu wants to merge 5 commits intomainfrom
andreicautisanu/OPIK-5020-wire-registry-routing

Conversation

@AndreiCautisanu
Copy link
Copy Markdown
Contributor

@AndreiCautisanu AndreiCautisanu commented Mar 25, 2026

Details

Wires the YAML-based LLM model registry (shipped in OPIK-5019) into the backend's provider routing and structured output detection. After this, new models added to the YAML are automatically routable — no Java enum changes needed.

Also adds remote YAML fetch from a configurable CDN URL with scheduled periodic refresh, enabling new models to reach running deployments without a redeploy. Remote fetch is disabled by default (remoteEnabled: false), so no behavior change for existing deployments.

Key changes:

  • Registry-first lookup in getLlmProvider() with enum fallback (backward compatible)
  • New getStructuredOutputStrategy() on LlmProviderFactory — encapsulates registry-based structured output resolution, simplifying all 3 online scoring callers
  • Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id
  • Remote YAML fetch via HttpClient with 30s timeout, URL scheme validation (SSRF defense)
  • ScheduledExecutorService refreshes registry on configurable interval (daemon thread)
  • 3-tier merge: classpath defaults → remote CDN → local customer override
  • Remote fetch failure is non-fatal at every level (logs warning, keeps previous registry)

Rollout context (OPIK-4866)

This is step 2 of the externalization initiative. The CDN infrastructure is being set up by DevOps in OPIK-5241 (in parallel). Once the CDN URL is available, we can set remoteEnabled: true and add the URL into the config.

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5020
  • Part of OPIK-4866

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: Full implementation with human guidance on design decisions
    • Human verification: Code review (2x self-review iterations), local Docker testing, curl endpoint verification, Playground UI validation

Testing

Unit tests (570 passing):

cd apps/opik-backend && mvn test -Dtest="LlmProviderFactoryTest,LlmModelRegistryServiceTest"

Tests cover:

  • findModel(): qualifiedName match → VertexAI, bare id → OpenAI, bare id with qualifiedName → empty (disambiguation), nonexistent → empty
  • getStructuredOutputStrategy(): registry model with structuredOutput=true → ToolCallingStrategy, structuredOutput=false → InstructionStrategy
  • All 555 existing LlmProviderFactoryTest cases pass through the new registry-first path
  • All 13 LlmModelRegistryServiceTest cases pass (9 existing + 4 new findModel tests)

Quality checks:

mvn compile -DskipTests  # passes
mvn spotless:check        # passes

Local Docker testing:

  • Verified GET /api/v1/private/llm/models returns 525 models across 5 providers with correct snake_case serialization
  • Sent messages in Playground and scored traces using various models — full routing path works end-to-end through registry-first lookup

Documentation

N/A — backend-only, no user-visible documentation changes needed. Self-hosted configuration docs will be updated when the full externalization is complete (OPIK-5022).

…add remote refresh

- Registry-first lookup in getLlmProvider() with enum fallback
- New getStructuredOutputStrategy() on LlmProviderFactory encapsulating
  registry-based structured output resolution
- Two-pass findModel() disambiguates VertexAI/Gemini by qualifiedName vs bare id
- Remote YAML fetch via HttpClient with 30s timeout and URL scheme validation
- ScheduledExecutorService refreshes registry from CDN on configurable interval
- 3-tier merge: classpath defaults → remote CDN → local customer override
- Remote fetch failure is non-fatal at every level (logs warning, keeps previous)
- remoteEnabled defaults to false — zero behavior change for existing deployments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added java Pull requests that update Java code Backend tests Including test files, or tests related like configuration. labels Mar 25, 2026
@AndreiCautisanu AndreiCautisanu marked this pull request as ready for review March 25, 2026 18:35
@AndreiCautisanu AndreiCautisanu requested a review from a team as a code owner March 25, 2026 18:35
Copy link
Copy Markdown
Contributor

@ldaugusto ldaugusto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the reusage of our infrastructured mentioned in the comments below, I believe we could have tests mocking the remote return for loadRemoteResource(): non-200 responses, malformed YAML, or any other problem you can think of.


private final LlmModelRegistryConfig config;
// volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020)
private final HttpClient httpClient;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remote fetch uses java.net.http.HttpClient (standard JDK), but the rest of the backend standardizes on autoinjected JAX-RS Client; check examples in WorkspaceNameService, OllamaService, and RemoteAuthService.

It's a better choice by consistency, lifecycle management, no need for the threadpool, testability and so.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 710c806 addressed this comment by replacing the java.net.http.HttpClient usage with an injected jakarta.ws.rs.client.Client for remote registry fetches. The new constructor wiring now takes the shared JAX-RS client and loadRemoteRegistry() performs the request through that Client, satisfying the consistency request.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ldaugusto added the JAX-RS Client, let me know if implementation is ok

private final LlmModelRegistryConfig config;
// volatile: reload() will be called from a scheduler thread (remote YAML refresh, OPIK-5020)
private final HttpClient httpClient;
private final ScheduledExecutorService scheduler;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, for reexecuting stuff at some frequency, you could reuse the same pattern from TraceThreadsClosingJob, DailyUsageReportJob, and so on.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 710c806 addressed this comment by moving the periodic reload out of the service and into the new LlmModelRegistryRefreshJob, which follows the same Dropwizard job/@On pattern used by TraceThreadsClosingJob and DailyUsageReportJob and simply delegates to registryService.reload() on its schedule.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks for pointing out

Andrei Căutișanu and others added 2 commits March 26, 2026 16:53
- Replace java.net.http.HttpClient with JAX-RS Client (injected via Guice)
  for consistency with WorkspaceNameService, OllamaService, RemoteAuthService
- Replace ScheduledExecutorService with LlmModelRegistryRefreshJob Quartz job
  matching TraceThreadsClosingJob, DailyUsageReportJob patterns
- Add remote fetch tests: successful merge, non-200 fallback, malformed YAML
  fallback, disabled remote skips fetch

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…test

Quartz @on cron replaced the ScheduledExecutorService, making the
configurable interval unused. Removed to avoid misleading operators.
Strengthened disabled-remote test to verify mock client is never called.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 12

2 tests   0 ✅  13m 43s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 7

2 tests   0 ✅  27m 31s ⏱️
3 suites  0 💤
3 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 15

2 tests   0 ✅  13m 43s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 11

2 tests   0 ✅  13m 46s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 4

3 tests   0 ✅  13m 43s ⏱️
3 suites  1 💤
3 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 5

2 tests   0 ✅  13m 46s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 1

1 tests   0 ✅  6m 56s ⏱️
1 suites  0 💤
1 files    0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 10

1 tests   0 ✅  6m 57s ⏱️
1 suites  0 💤
1 files    0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 2

1 tests   0 ✅  5m 12s ⏱️
1 suites  0 💤
1 files    0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 13

2 tests   0 ✅  13m 54s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 6

3 tests   0 ✅  18m 57s ⏱️
3 suites  0 💤
3 files    0 ❌  3 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 3

1 tests   0 ✅  6m 52s ⏱️
1 suites  0 💤
1 files    0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 16

3 tests   0 ✅  19m 1s ⏱️
3 suites  0 💤
3 files    0 ❌  3 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 14

9 tests   6 ✅  19m 8s ⏱️
5 suites  0 💤
5 files    0 ❌  3 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 8

2 tests   0 ✅  13m 54s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

@github-actions
Copy link
Copy Markdown
Contributor

Backend Tests - Integration Group 9

2 tests   0 ✅  13m 53s ⏱️
2 suites  0 💤
2 files    0 ❌  2 🔥

For more details on these errors, see this check.

Results for commit b25a462.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend java Pull requests that update Java code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants