You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This document focuses on running Code2Logic benchmarks **with an LLM enabled**, including how to force **Claude (Anthropic)** via provider/model selection.
6
+
7
+
## What benchmarks measure (important)
8
+
9
+
-**Format / project benchmarks** measure **reproduction quality from a spec** (structure + syntax + similarity heuristics).
10
+
-**High scores are not proof of runtime equivalence.** Runtime equivalence is validated only by tests / behavioral checks.
11
+
-`--no-llm` is a **pipeline/sanity mode** (template fallback), not meaningful for comparing LLM quality.
12
+
13
+
## Key artifacts
14
+
15
+
### `project.toon`
16
+
17
+
Project-level TOON (structure of modules/classes/functions). Good for “big picture”.
18
+
19
+
### `function.toon`
20
+
21
+
Function-logic TOON (detailed per-function index). In this repo, `function.toon` is generated by:
python examples/15_unified_benchmark.py --type format --workers 2
97
+
```
98
+
99
+
### Output token limit
100
+
101
+
```bash
102
+
python examples/15_unified_benchmark.py --type format --max-tokens 2500
103
+
```
104
+
105
+
Guidance:
106
+
107
+
- Lower `--max-tokens` reduces cost and latency, but may reduce reproduction quality.
108
+
- Increase `--max-tokens` for larger files/specs, but expect slower runs.
109
+
110
+
## Troubleshooting
111
+
112
+
-**No provider available**: configure keys/models in `.env` or via `code2logic llm ...` (see `08-llm-integration.md`).
113
+
-**Rate limited**: reduce `--workers`, consider cheaper/faster model (e.g. Haiku), or switch provider.
114
+
-**Weird output (explanations instead of code)**: use stricter prompts or lower temperature on provider side; benchmark runner already tries to extract fenced code blocks.
0 commit comments