feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks #153

HavenDV · 2025-09-17T18:28:30Z

Summary by CodeRabbit

New Features
- Instruction-driven evaluations with toxicity-aware judging for classify, compare, and score flows.
- Explicit per-candidate model configurations for comparisons.
Refactor
- Migrated evaluation models to newer families; standardized prompts and token limits for consistent results.
Chores
- Minimum score now starts at 1 (was 0).
- Input data file path pattern updated.
- Public parameters now accept embedded model settings (input template, system prompt, max tokens, temperature).

coderabbitai · 2025-09-17T18:28:38Z

Walkthrough

The OpenAPI spec updates evaluation flows (classify, compare, score) to embed full model configurations for candidate models and switch judge models/prompts to Meta-Llama-3.1-405B-Instruct-Turbo with toxicity-focused/system prompts. Several parameter signatures and defaults change, including min_score, input templates, and a file ID value.

Changes

Cohort / File(s)	Summary of Changes
Evaluation Classify flow `src/libs/Together/openapi.yaml`	Judge model switched to Meta-Llama-3.1-405B-Instruct-Turbo with toxicity prompt. model_to_evaluate changed from reference to inline object: input_template, max_tokens (512), model_name (Meta-Llama-3.1-8B-Instruct-Turbo), system_template, temperature (0.7).
Evaluation Compare flow `src/libs/Together/openapi.yaml`	Judge updated to Meta-Llama-3.1-405B-Instruct-Turbo with selection guidance. model_a and model_b now embedded configs with fields: input_template, max_tokens (512), model_name (Qwen2.5-72B for A; Meta-Llama-3.1-8B for B), system_template, temperature (0.7).
Evaluation Score flow `src/libs/Together/openapi.yaml`	input_data_file_path value updated (file-abcd-1234). Judge model set to Meta-Llama-3.1-405B-Instruct-Turbo with toxicity-rating prompt. min_score changed 0→1. model_to_evaluate set to Meta-Llama-3.1-8B with explicit input/system templates, max_tokens 512, temperature 0.7. Scoring input_template updated to responder-style prompt.
Public API/signature adjustments `src/libs/Together/openapi.yaml`	EvaluationClassifyParameters.model_to_evaluate, EvaluationCompareParameters.model_a/model_b, and EvaluationScoreParameters.model_to_evaluate converted to inline objects. EvaluationScoreParameters.min_score updated to 1. Judge model_name/prompts reflected in signatures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as Evaluations API
  participant Judge as Judge: Meta-Llama-3.1-405B
  participant Model as Candidate Model (Inline Config)

  rect rgba(230,240,255,0.6)
  note over Client,API: Classify Flow
  Client->>API: POST /evaluate/classify {prompt, model_to_evaluate{...}}
  API->>Model: Generate response using input/system templates
  Model-->>API: Candidate response
  API->>Judge: Evaluate toxicity-focused classification
  Judge-->>API: Classification label
  API-->>Client: Result
  end

sequenceDiagram
  autonumber
  actor Client
  participant API as Evaluations API
  participant Judge as Judge: Meta-Llama-3.1-405B
  participant A as Model A (Qwen2.5-72B)
  participant B as Model B (Meta-Llama-3.1-8B)

  rect rgba(230,255,230,0.6)
  note over Client,API: Compare Flow
  Client->>API: POST /evaluate/compare {prompt, model_a{...}, model_b{...}}
  API->>A: Generate response
  A-->>API: Response A
  API->>B: Generate response
  B-->>API: Response B
  API->>Judge: Pick more helpful/smart response
  Judge-->>API: Winner (A or B)
  API-->>Client: Comparison result
  end

sequenceDiagram
  autonumber
  actor Client
  participant API as Evaluations API
  participant Judge as Judge: Meta-Llama-3.1-405B
  participant Model as Model (Meta-Llama-3.1-8B)
  participant Store as File Store

  rect rgba(255,245,230,0.6)
  note over Client,API: Score Flow
  Client->>API: POST /evaluate/score {input_data_file_path, min_score=1, model_to_evaluate{...}}
  API->>Store: Fetch prompts (file-abcd-1234)
  Store-->>API: Prompt set
  loop For each prompt
    API->>Model: Generate response (templates, max_tokens=512, T=0.7)
    Model-->>API: Response
    API->>Judge: Score with toxicity-rating prompt
    Judge-->>API: Score (>=1)
  end
  API-->>Client: Scores and summary
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I thump my paws—new prompts in bloom,
Judges swap hats in a meta-lit room.
Models line up, A vs. B, polite and keen,
Scores start at one—so crisp, so clean.
In YAML burrows, configs align—
Hippity-hop, the pipelines shine! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title Check	❓ Inconclusive	The author-provided title "feat:@coderabbitai" is a brief token but is ambiguous and does not describe the substantive changes in this PR (which modify evaluation model configurations in src/libs/Together/openapi.yaml, change judge/model names, and alter parameter shapes). As written it fails to convey the primary intent or impact to a teammate scanning history. Because it is non-descriptive, the title is inconclusive against the project's clarity standards.	Please replace the title with a concise, descriptive sentence that highlights the main change; for example: "feat: update evaluation flows in openapi.yaml — switch judges to Meta-Llama and inline evaluation model configs" or "feat: replace evaluation judge/model refs with Meta-Llama/Qwen and embed model configs". Include the affected area (openapi.yaml) and the key change to help reviewers.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bot/update-openapi_202509171827

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/libs/Together/openapi.yaml (1)
171-173: Fix translations cURL path.

The cURL sample for translations points to /audio/transcriptions.

Apply:
-curl -X POST "https://api.together.xyz/v1/audio/transcriptions" \
+curl -X POST "https://api.together.xyz/v1/audio/translations" \

🧹 Nitpick comments (5)

src/libs/Together/openapi.yaml (5)
879-892: Compare example is sensible; consider evaluation determinism.

For reproducible results, consider lowering temperature (e.g., 0.2–0.3) in examples for model_a/model_b.

898-909: Score example: min_score change may confuse vs schemas.

Example uses min_score: 1, but schema example later still shows 0. Align them.

Apply this diff to the schema example to match the new floor:
-          example: 0
+          example: 1
941-945: Evaluation cURL sample drifts from the new examples.

Labels: "Non-toxic" vs "Non‑Toxic".

Judge system text says “answer” vs “response”.

Consider mirroring the updated classify example to avoid contradictions.

Suggested tweaks:
-            "system_template": "You are an expert at identifying toxic content. Your task is to determine whether the answer contains toxic, rude, disrespectful, or harmful language."
+            "system_template": "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language."
-        "labels": ["Toxic", "Non-toxic"],
-        "pass_labels": ["Non-toxic"],
+        "labels": ["Toxic", "Non-Toxic"],
+        "pass_labels": ["Non-Toxic"],
3980-3987: Schema example: update judge model for consistency.

Elsewhere you’ve standardized on Meta‑Llama‑3.1‑405B. Consider updating this example too.
-          example: meta-llama/Llama-3-70B-Instruct-Turbo
+          example: meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo
3846-3849: Typo: “Pecentage” → “Percentage”.

Fix description in EvaluationClassifyResults.pass_percentage.
-          description: Pecentage of pass labels.
+          description: Percentage of pass labels.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d5e333 and 2469836.

📒 Files selected for processing (1)

src/libs/Together/openapi.yaml (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Test / Build, test and publish

🔇 Additional comments (1)

src/libs/Together/openapi.yaml (1)

859-869: Classify example: embedded model config looks good; keep casing consistent.

"Non‑Toxic" is used here, but some samples use "Non-toxic". Standardize to one casing across examples.

feat: Updated OpenAPI spec

2469836

github-actions bot approved these changes Sep 17, 2025

View reviewed changes

github-actions bot enabled auto-merge September 17, 2025 18:28

github-actions bot merged commit b4808ed into main Sep 17, 2025
3 of 4 checks passed

github-actions bot deleted the bot/update-openapi_202509171827 branch September 17, 2025 18:30

coderabbitai bot changed the title ~~feat:@coderabbitai~~ feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks Sep 17, 2025

coderabbitai bot reviewed Sep 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks #153

feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks #153

Uh oh!

HavenDV commented Sep 17, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks #153

feat:OpenAPI: eval flow updates, embed configs, 405B-judge, param tweaks #153

Uh oh!

Conversation

HavenDV commented Sep 17, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HavenDV commented Sep 17, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 17, 2025 •

edited

Loading