You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"""Evaluates if the generated mermaid diagram is valid.
322
+
323
+
This method validates the mermaid diagram in the output, handling
324
+
retries and logging the results.
325
+
326
+
Args:
327
+
ctx: The evaluator context.
328
+
329
+
Returns:
330
+
1.0 if the diagram is valid, 0.0 otherwise.
331
+
"""
277
332
# Skip validation if there was a failure
278
333
ifctx.outputandctx.output.failure_reason:
279
334
logfire.info(
@@ -331,14 +386,17 @@ async def evaluate(
331
386
asyncdeffix_mermaid_diagram(
332
387
inputs: MermaidInput, model: str=DEFAULT_MODEL
333
388
) ->MermaidOutput:
334
-
"""Fix an invalid mermaid diagram using the agent with multiple MCP servers.
389
+
"""Fixes an invalid mermaid diagram using an agent with multiple MCP servers.
390
+
391
+
This function runs an agent to fix a given mermaid diagram, handling
392
+
various exceptions and capturing metrics.
335
393
336
394
Args:
337
-
inputs: The input containing the invalid diagram
338
-
model: The model to use for the agent
395
+
inputs: The input containing the invalid diagram.
396
+
model: The model to use for the agent.
339
397
340
398
Returns:
341
-
MermaidOutput with the fixed diagram and captured metrics
399
+
A MermaidOutput object with the fixed diagram and captured metrics.
342
400
"""
343
401
query=f"Add the current time and fix the mermaid diagram syntax using the validator: {inputs.invalid_diagram}. Return only the fixed mermaid diagram between backticks."
344
402
@@ -477,13 +535,16 @@ async def _run_agent():
477
535
defcreate_evaluation_dataset(
478
536
judge_model: str=DEFAULT_MODEL,
479
537
) ->Dataset[MermaidInput, MermaidOutput, Any]:
480
-
"""Create the dataset for evaluating mermaid diagram fixing.
538
+
"""Creates the dataset for evaluating mermaid diagram fixing.
539
+
540
+
This function constructs a dataset with test cases of varying difficulty
541
+
and a set of evaluators for judging the results.
481
542
482
543
Args:
483
-
judge_model: The model to use for LLM judging
544
+
judge_model: The model to use for LLM judging.
484
545
485
546
Returns:
486
-
The evaluation dataset
547
+
The evaluation dataset.
487
548
"""
488
549
returnDataset[MermaidInput, MermaidOutput, Any](
489
550
# Construct 3 tests, each asks the LLM to fix an invalid mermaid diagram of increasing difficulty
0 commit comments