Generate synthetic query-citation pairs from a knowledge base using Claude CLI for RAG evaluation. Uses a two-call pipeline (Sonnet for queries, Haiku for citations) with exact span validation. Includes quality analysis and an iterative prompt improvement loop.
# 1. Install requirements
brew install jq
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Generate synthetic data (sample 10 files)
./generate.sh --kb-path ./kb --queries-path ./queries/queries.json --sample 10
# 3. Analyze quality against real queries
./analysis.sh
# 4. Improve prompt based on analysis
./improve_prompt.sh
# 5. Or run the full automated pipeline
./run_pipeline.sh --kb-path ./kb --queries-path ./queries/queries.json
# 6. (Optional) Upload to LangSmith
cp example.env .env
# Edit .env with your LANGSMITH_API_KEY
./upload_langsmith.sh| Tool | Installation | Purpose |
|---|---|---|
| Claude CLI | Download | Generate queries & extract citations |
| jq | brew install jq |
JSON processing |
| uv | curl -LsSf https://astral.sh/uv/install.sh | sh |
Python package manager |
Python dependencies (langsmith) are auto-managed by uv - no manual install needed.
The generation pipeline uses two Claude calls per KB file:
KB Directory ──[--sample N]──→ N random files
│
┌────┴────┐
│ Phase 1 │ Call 1 per file (Sonnet, parallel ×20)
│ │ prompt.md → generates diverse queries + metadata
└────┬────┘
│
┌────┴────┐
│ Phase 2 │ Call 2 per file (Haiku, parallel ×20)
│ │ citation_prompt.md → extracts verbatim citations
└────┬────┘
│
┌────┴────┐
│ Phase 3 │ Merge + Validate (Python)
│ │ exact match, compute start_index/end_index
└────┬────┘
├──→ output.jsonl (valid pairs with spans)
└──→ rejected.jsonl (invalid pairs with reasons)
Why two calls?
- Call 1 (Sonnet) focuses on generating realistic, diverse queries — typos, multilingual, varied tone/style — without the constraint of finding exact citations
- Call 2 (Haiku) focuses purely on finding exact verbatim text in the document, which is a simpler task suited for a faster/cheaper model
- This separation improves both query diversity and citation accuracy
./generate.sh --kb-path ./kb --queries-path ./queries/queries.json --sample 8CLI flags:
| Flag | Description | Default |
|---|---|---|
--kb-path <dir> |
Knowledge base directory | prompt user |
--queries-path <file> |
Real user queries for style reference | prompt user |
--sample N |
Randomly select N files from KB | all files |
--no-resume |
Start fresh, ignore saved state | resume if available |
Environment variables:
| Variable | Description | Default |
|---|---|---|
MAX_PARALLEL |
Max concurrent Claude calls | 20 |
Features:
- Two-phase parallel generation (Sonnet → Haiku)
- Exact citation validation with character span computation
- Question deduplication (95% fuzzy match)
- Resume capability if interrupted
- Rejected pairs logged with reasons
Output:
output.jsonl- Valid query-citation pairs with spansrejected.jsonl- Failed validations with reasons
./analysis.shThe script will interactively prompt for:
- Synthetic data path (required): Path to
output.jsonl - Real queries path (required): Path to real production queries JSON
What it does:
- Sends both datasets to Claude CLI for comparison
- Scores 8 dimensions (0-100%) with letter grades:
- Language distribution
- Typos & messiness
- Query length & complexity
- Topic coverage (2x weight)
- Intent & behavior (2x weight)
- Tone & formality
- Formatting artifacts
- Question style
- Calculates weighted overall similarity score
- Provides actionable recommendations
Output:
analysis_report.md- Full report with scores, examples, and recommendations- Summary table displayed in terminal
./improve_prompt.shThe script will interactively prompt for:
- Real queries path (required): Path to real production queries JSON
What it does:
- Reads the current
prompt.mdandanalysis_report.md - Samples 15 diverse real queries as style reference
- Sends everything to Claude CLI to generate an improved prompt
- Shows a diff of proposed changes
- Asks for approval before applying
Before applying changes, it backs up the current version:
backups/
├── v1/
│ ├── prompt.md # The prompt used
│ ├── output.jsonl # The synthetic data generated
│ ├── analysis_report.md # The analysis that triggered changes
│ └── metadata.json # Run metadata (score, timestamp, counts)
├── v2/
│ └── ...
The iterative improvement loop:
./generate.sh → ./analysis.sh → ./improve_prompt.sh → (repeat)
↓ ↓ ↓
output.jsonl analysis_report.md prompt.md (improved)
Each iteration should produce a higher analysis score as the prompt gets refined.
./run_pipeline.shAutomates the full feedback loop: generate -> analyze -> (improve -> generate -> analyze) until the target score is reached or max iterations are exhausted.
CLI flags (skip interactive prompts):
| Flag | Description | Default |
|---|---|---|
--kb-path <dir> |
Knowledge base directory | prompt user |
--queries-path <file> |
Real production queries JSON | prompt user |
--max-iterations <n> |
Maximum iteration count | 10 |
--target-score <n> |
Target similarity score (0-100) | 85 |
Example (fully non-interactive):
./run_pipeline.sh \
--kb-path ./kb \
--queries-path ./queries/queries.json \
--max-iterations 5 \
--target-score 70# Setup credentials
cp example.env .env
# Edit .env and add your LANGSMITH_API_KEY
# Upload
./upload_langsmith.shGet your API key from: https://smith.langchain.com/settings
├── generate.sh # Two-call generation pipeline
├── prompt.md # Call 1 prompt — generate queries only (Sonnet)
├── citation_prompt.md # Call 2 prompt — extract verbatim citations (Haiku)
├── analysis.sh # Analyze quality against real queries
├── improve_prompt.sh # Improve prompt based on analysis
├── run_pipeline.sh # Automated feedback loop
├── upload_langsmith.sh # Upload to LangSmith
├── CLAUDE.md # Claude instructions
├── pyproject.toml # Python dependencies
├── example.env # Environment template
├── helpers/
│ ├── validate.py # Exact citation matching & span computation
│ └── upload_langsmith.py # LangSmith upload logic
├── backups/ # Version history (auto-created)
│ └── v1/
│ ├── prompt.md
│ ├── output.jsonl
│ ├── analysis_report.md
│ └── metadata.json
├── kb/ # Your knowledge base (markdown files)
└── queries/ # Real user queries
└── queries.json
Place markdown files in your KB directory. Two formats supported:
With YAML frontmatter:
---
url: https://example.com/docs/page
title: Page Title
---
# Content
Your content here...Plain markdown:
# Content
Just content, no metadata...{
"metadata": {
"total_queries": 10,
"type": "Valid Queries"
},
"queries": [
{"query": "How do I reset my password?", "topic": "account"},
{"query": "What are the pricing tiers?", "topic": "billing"}
]
}Each line in output.jsonl:
{
"query": "whats ur refund polcy?",
"doc_id": "refund-policy.md",
"citation": "Refunds are available within **30 days** of purchase.",
"start_index": 1842,
"end_index": 1892,
"category": "billing",
"subcategory": "refunds",
"chunks": ["Full paragraph containing the citation..."],
"source": ["https://example.com/docs/billing"],
"query_metadata": {
"language": "en",
"has_typos": true,
"tone": "casual",
"style": "question"
}
}| Field | Type | Description |
|---|---|---|
| query | string | Realistic user input (may contain typos, non-English, etc.) |
| doc_id | string | KB filename the citation comes from |
| citation | string | Exact verbatim text from source document |
| start_index | int | Character offset where citation begins in source |
| end_index | int | Character offset where citation ends in source |
| category | string | Topic category |
| subcategory | string? | More specific classification |
| chunks | string[] | Broader passages containing the citation |
| source | string[] | Source URLs or filenames |
| query_metadata | object | Language, typos, tone, style metadata |
Span verification: source_content[start_index:end_index] == citation
Claude automatically decides based on content length:
- Short pages: 3-5 queries
- Medium pages: 5-8 queries
- Long pages: 8-10 queries
- Maximum: 10 per page
| Check | Method | Action if failed |
|---|---|---|
| Citation match | Exact str.find() + whitespace normalization fallback |
Rejected (citation_not_found) |
| Null citation | Citation returned as null by Haiku | Rejected (citation_null) |
| Question duplicate | 95% fuzzy match | Rejected (duplicate_question) |
If interrupted, the script saves progress to .generation_state.json. On restart:
Found previous session with 3/15 files processed. Resume? [Y/n]
- Edit
prompt.mdto customize query generation (types, tone, language distribution) - Edit
citation_prompt.mdto customize citation extraction rules
# .env file
LANGSMITH_API_KEY=your_api_key
LANGSMITH_ENDPOINT=https://api.smith.langchain.com # Optional
LANGSMITH_DATASET_NAME=my-dataset # Optional# Shell environment
MAX_PARALLEL=10 ./generate.sh --kb-path ./kb # Limit concurrent callsInstall from https://claude.ai/download
brew install jqcurl -LsSf https://astral.sh/uv/install.sh | shCheck rejected.jsonl for citation_not_found entries. This means Haiku's extracted text didn't exactly match the source. The whitespace normalization fallback handles minor differences, but paraphrased citations will be rejected by design.
Delete .generation_state.json to start fresh:
rm .generation_state.jsonOr use the --no-resume flag.