Feature Request: Beam Search-style Credit Assignment for Step-level Agent Evaluation #11838
TekkenSteve
started this conversation in
Ideas
Replies: 1 comment
-
|
Thanks for adding this here, @TekkenSteve! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Describe the feature or potential improvement
Core Idea: Beyond single-step scoring, we need Beam Search-like evaluation: does a change at Node X increase the probability of reaching high-value final outcomes?
Essence: Step-level Credit Assignment — quantifying the marginal contribution of a specific node modification to final success (e.g., "If I change Node 3, how does it affect the win rate 5 steps later?").
Concrete Scenarios:
"Whack-a-mole" regressions (RAG Pipeline): Tuned the retrieval prompt to boost recall, but silently broke the synthesis node downstream. Current tools show Node A improved (90→95), while missing Node C's failure spike (5→40%). We need to detect subtle trade-offs between nodes, not just local improvements.
Turing Test Chatbot: After tweaking a personality node, how to prove it's engineering progress rather than degradation? Single-turn fluency looks good, yet multi-turn consistency collapses after 10 exchanges. Need trajectory-level value estimation to quantify true version improvements.
Current Limitation: Isolated step scoring optimizes for local optimum, not global success. Optimizing one node can create invisible damage downstream.
Questions:
Do you support checkpoint forking + Monte Carlo rollout to estimate node "future value"?
If not, is this on the roadmap?
I'm eager to contribute implementation/feedback.
Possible Technical Direction: Leverage LangGraph checkpoints to fork state at any node, sample N continuations (different temps/seeds), calculate success rate as the node's Beam Value, and surface this in regression comparisons.
Additional information
No response
Beta Was this translation helpful? Give feedback.
All reactions