Skip to content

LangSmith Experiments on Single-turn dataset #245

@yangm2

Description

@yangm2

This issue captures ideas for experiments to run against single-turn baseline results

  • gemini 2.5 pro w/o RAG - quantify how much RAG is improving correctness
  • gemini 3.0 preview w/ RAG - quantify correctness/tokens-per-second/tone/cost-per-answer
  • chatGPT frontier model (5.2?) w/ VertexSearch - gemini vs chatGPT
  • thinking budget (in tokens) - investigate correlation factor between thinking budget and correctness, thinking budget and cost-per-answer
  • gemini 2.5 pro w/ few-shot examples ??? - correctness
  • fine-tune gemini 2.5 pro ??? - correctness
  • a/b-test variations on system prompt - correctness

Metadata

Metadata

Assignees

Labels

backendBot implementation and other backend concerns

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions