|
| 1 | +# Skill: Generate Domain Questions |
| 2 | + |
| 3 | +Generate high-quality multiple-choice questions for the Knowledge Mapper application. |
| 4 | + |
| 5 | +## When to Use |
| 6 | + |
| 7 | +Use this skill when asked to generate or regenerate questions for a domain (e.g., "generate questions for biology", "regenerate the physics question set"). |
| 8 | + |
| 9 | +## Context |
| 10 | + |
| 11 | +Knowledge Mapper is a GP-based knowledge estimation app. Users answer multiple-choice questions positioned on a 2D map of Wikipedia articles. Question quality directly impacts the usefulness of knowledge estimation. |
| 12 | + |
| 13 | +### Current Problems with Questions |
| 14 | +- Questions are too long and verbose (avg ~190 chars, options up to 1000 chars) |
| 15 | +- Many questions can be answered by logic alone rather than actual knowledge |
| 16 | +- Difficulty levels don't clearly separate vocabulary knowledge from deep understanding |
| 17 | +- Distractors are often obviously wrong or implausibly long compared to the correct answer |
| 18 | + |
| 19 | +### Target Question Format |
| 20 | + |
| 21 | +```json |
| 22 | +{ |
| 23 | + "id": "<16-char hex>", |
| 24 | + "question_text": "...", |
| 25 | + "options": { "A": "...", "B": "...", "C": "...", "D": "..." }, |
| 26 | + "correct_answer": "A", |
| 27 | + "difficulty": 3, |
| 28 | + "x": 0.224806, |
| 29 | + "y": 0.56408, |
| 30 | + "z": 0.0, |
| 31 | + "source_article": "photosynthesis", |
| 32 | + "domain_ids": ["biology"], |
| 33 | + "concepts_tested": ["photosynthesis", "cellular respiration"] |
| 34 | +} |
| 35 | +``` |
| 36 | + |
| 37 | +### Length Targets |
| 38 | +- **Question text**: 50-100 words (roughly 250-600 characters). Concise but specific. |
| 39 | +- **Each answer option**: 25-50 words (roughly 125-300 characters). All four options should be similar in length and style. |
| 40 | +- **All options must be plausible** to someone without domain expertise. |
| 41 | + |
| 42 | +### Difficulty Levels (1-4) |
| 43 | + |
| 44 | +Assign each of the 50 concepts a difficulty level (roughly equal distribution: ~12-13 per level): |
| 45 | + |
| 46 | +- **Level 1 — High-level vocabulary**: Can someone identify what this concept IS? Tests recognition of major terms and their basic definitions. Someone who has heard of the field can likely answer these. |
| 47 | +- **Level 2 — Low-level vocabulary**: Can someone identify specific technical terms, sub-components, or named results within this concept? Tests familiarity with the detailed terminology that practitioners use. |
| 48 | +- **Level 3 — Basic working knowledge**: Can someone apply or reason about this concept? Tests understanding that goes beyond definitions — requires knowing how things relate, why they matter, or what happens when you combine them. Cannot be answered through logic alone or rote memorization alone. |
| 49 | +- **Level 4 — Deep knowledge**: Can someone handle nuance, edge cases, or cross-cutting implications of this concept? Tests expert-level understanding — subtle distinctions, historical context of discoveries, common misconceptions among practitioners, or non-obvious connections to other concepts. Cannot be answered through logic alone or rote memorization alone. |
| 50 | + |
| 51 | +### Critical Quality Rules |
| 52 | + |
| 53 | +1. **Logic-proof**: A smart person with NO domain knowledge should NOT be able to answer correctly through reasoning alone. Avoid options that are self-contradictory, obviously absurd, or eliminable by logic. |
| 54 | +2. **Uniform option length**: All 4 options for a given question must be approximately the same length (within ~30% of each other). The correct answer must NOT be systematically longer or shorter. |
| 55 | +3. **Uniform option style**: All 4 options should use the same grammatical structure, level of specificity, and tone. No option should stand out stylistically. |
| 56 | +4. **Plausible distractors**: Each distractor must sound reasonable to a non-expert. It should contain real terminology from the domain, not made-up terms. |
| 57 | +5. **LaTeX formatting**: Use `$...$` for inline math expressions. Escape dollar signs in non-math contexts. Use proper LaTeX for equations, variables, and mathematical notation. |
| 58 | +6. **No giveaways**: Avoid "all of the above", "none of the above", absolute qualifiers ("always", "never") that signal incorrectness, or hedging language ("sometimes", "may") that signals correctness. |
| 59 | + |
| 60 | +## Procedure |
| 61 | + |
| 62 | +IMPORTANT: Use the TodoWrite tool throughout this entire process to track every step and every question. This allows resuming if context runs out. |
| 63 | + |
| 64 | +### Phase 1: Concept Generation |
| 65 | + |
| 66 | +**Step 1.1**: Create the master todo list: |
| 67 | + |
| 68 | +``` |
| 69 | +TodoWrite([ |
| 70 | + { content: "Phase 1: Generate 50 core concepts for {domain}", status: "in_progress", activeForm: "Generating core concepts" }, |
| 71 | + { content: "Phase 2: Curate and deduplicate concept list", status: "pending", activeForm: "Curating concept list" }, |
| 72 | + { content: "Phase 3: Generate questions (0/50 complete)", status: "pending", activeForm: "Generating questions" }, |
| 73 | + { content: "Phase 4: Quality review and validation", status: "pending", activeForm: "Reviewing question quality" }, |
| 74 | + { content: "Phase 5: Assemble domain JSON file", status: "pending", activeForm: "Assembling domain JSON" }, |
| 75 | +]) |
| 76 | +``` |
| 77 | + |
| 78 | +**Step 1.2**: Generate 60 candidate concepts that are central to the domain. For each concept, note: |
| 79 | +- The concept name |
| 80 | +- A 1-sentence description of why it's central to this domain |
| 81 | +- The most relevant Wikipedia article title |
| 82 | + |
| 83 | +**Step 1.3**: Write the candidate list to a working file: `data/domains/.working/{domain-id}-concepts.json` |
| 84 | + |
| 85 | +### Phase 2: Concept Curation |
| 86 | + |
| 87 | +**Step 2.1**: Review the 60 candidates. Remove: |
| 88 | +- Duplicates or near-duplicates (e.g., "DNA replication" and "replication of DNA") |
| 89 | +- Concepts that are too broad (e.g., "science") or too narrow (e.g., "Figure 3 in Smith et al. 2019") |
| 90 | +- Concepts that heavily overlap (keep the more central one) |
| 91 | + |
| 92 | +**Step 2.2**: Rank remaining concepts by centrality to the domain. Keep the top 50. |
| 93 | + |
| 94 | +**Step 2.3**: Assign difficulty levels to the 50 concepts: |
| 95 | +- ~12-13 concepts at Level 1 (high-level vocabulary) |
| 96 | +- ~12-13 concepts at Level 2 (low-level vocabulary) |
| 97 | +- ~12-13 concepts at Level 3 (basic working knowledge) |
| 98 | +- ~12-13 concepts at Level 4 (deep knowledge) |
| 99 | + |
| 100 | +Level assignment should reflect how specialized the concept is, NOT how hard the question will be to write. Central, well-known concepts get Level 1-2. Specialized, nuanced concepts get Level 3-4. |
| 101 | + |
| 102 | +**Step 2.4**: Update the working file with the curated, ranked, leveled list. |
| 103 | + |
| 104 | +### Phase 3: Question Generation (per concept) |
| 105 | + |
| 106 | +For EACH of the 50 concepts, follow this sub-procedure. Update TodoWrite after EVERY question: |
| 107 | + |
| 108 | +**Step 3.1 — Research**: Use WebFetch to read the Wikipedia article for this concept: |
| 109 | +``` |
| 110 | +WebFetch({ url: "https://en.wikipedia.org/wiki/{article_title}", prompt: "Summarize the key facts, definitions, relationships, and nuances of {concept}. Focus on what distinguishes expert knowledge from surface knowledge." }) |
| 111 | +``` |
| 112 | + |
| 113 | +**Step 3.2 — Generate question**: Based on the Wikipedia content and the assigned difficulty level, write the question text. Follow the level definitions: |
| 114 | +- Level 1: Test recognition of what this concept IS |
| 115 | +- Level 2: Test knowledge of specific technical terms within the concept |
| 116 | +- Level 3: Test ability to reason about or apply the concept |
| 117 | +- Level 4: Test expert-level nuance, edge cases, or cross-cutting connections |
| 118 | + |
| 119 | +The question must be 50-100 words. It must be impossible to answer through logic alone. |
| 120 | + |
| 121 | +**Step 3.3 — Generate correct answer**: Write the correct answer (25-50 words). It must be factually accurate per the Wikipedia article. |
| 122 | + |
| 123 | +**Step 3.4 — Generate distractors**: Generate 3 incorrect options. For EACH distractor: |
| 124 | +1. Start from the correct answer |
| 125 | +2. Change ONE specific factual claim to make it incorrect |
| 126 | +3. Verify the distractor: |
| 127 | + - Uses real domain terminology (not made-up words) |
| 128 | + - Is approximately the same length as the correct answer (within ~30%) |
| 129 | + - Uses the same grammatical structure as the correct answer |
| 130 | + - Would sound plausible to a non-expert |
| 131 | + - Cannot be eliminated through logic alone |
| 132 | +4. If the distractor fails any check, regenerate it |
| 133 | + |
| 134 | +**Step 3.5 — Assign option slots**: Randomly assign the correct answer and 3 distractors to slots A, B, C, D. Use a different random arrangement for each question (do NOT always put the correct answer in slot A). Record which slot contains the correct answer. |
| 135 | + |
| 136 | +**Step 3.6 — Self-check**: Before finalizing, verify: |
| 137 | +- [ ] Question is 50-100 words |
| 138 | +- [ ] Each option is 25-50 words |
| 139 | +- [ ] All options are within ~30% length of each other |
| 140 | +- [ ] No option is eliminable through logic alone |
| 141 | +- [ ] LaTeX is properly formatted (if applicable) |
| 142 | +- [ ] The correct answer is factually accurate |
| 143 | +- [ ] Each distractor contains exactly one changed factual claim |
| 144 | +- [ ] Correct answer slot varies across questions |
| 145 | + |
| 146 | +**Step 3.7 — Update progress**: Update the TodoWrite with current progress: |
| 147 | +``` |
| 148 | +{ content: "Phase 3: Generate questions ({N}/50 complete)", ... } |
| 149 | +``` |
| 150 | + |
| 151 | +Also write each completed question to the working file incrementally: |
| 152 | +`data/domains/.working/{domain-id}-questions.json` |
| 153 | + |
| 154 | +### Phase 4: Quality Review |
| 155 | + |
| 156 | +**Step 4.1**: Read through ALL 50 questions and check for: |
| 157 | +- Option length uniformity within each question |
| 158 | +- Correct answer position distribution (should be ~even across A/B/C/D) |
| 159 | +- No repeated patterns in distractor construction |
| 160 | +- LaTeX consistency |
| 161 | +- No two questions testing the exact same knowledge |
| 162 | + |
| 163 | +**Step 4.2**: Fix any issues found. Log fixes in TodoWrite. |
| 164 | + |
| 165 | +**Step 4.3**: Verify the correct answer position distribution: |
| 166 | +- Count how many times each slot (A/B/C/D) contains the correct answer |
| 167 | +- If any slot has fewer than 9 or more than 16, reassign some correct answer slots to balance |
| 168 | + |
| 169 | +### Phase 5: Assembly |
| 170 | + |
| 171 | +**Step 5.1**: Read the existing domain file to preserve the `domain`, `labels`, and `articles` sections: |
| 172 | +``` |
| 173 | +Read({ file_path: "data/domains/{domain-id}.json" }) |
| 174 | +``` |
| 175 | + |
| 176 | +**Step 5.2**: Determine spatial coordinates for each question. Each question needs `x`, `y` coordinates within the domain's region (from index.json). Use the source article's coordinates if available in the articles array, otherwise distribute evenly within the region. |
| 177 | + |
| 178 | +**Step 5.3**: Generate a unique 16-character hex ID for each question: |
| 179 | +```python |
| 180 | +import hashlib |
| 181 | +id = hashlib.md5(f"{domain_id}:{concept}:{question_text[:50]}".encode()).hexdigest()[:16] |
| 182 | +``` |
| 183 | + |
| 184 | +**Step 5.4**: Assemble the complete domain JSON file with the new questions replacing the old ones. Write to `data/domains/{domain-id}.json`. |
| 185 | + |
| 186 | +**Step 5.5**: Update `data/domains/all.json` — replace the old questions for this domain with the new ones, preserving questions from other domains. |
| 187 | + |
| 188 | +**Step 5.6**: Update `data/domains/index.json` — update the `question_count` for this domain if it changed. |
| 189 | + |
| 190 | +**Step 5.7**: Clean up working files in `data/domains/.working/`. |
| 191 | + |
| 192 | +## Important Notes |
| 193 | + |
| 194 | +- **Checkpointing**: Write progress to `data/domains/.working/` after EVERY question. If context runs out, the next agent can resume from the working files. |
| 195 | +- **Model quality**: This skill should be run with Claude Opus for highest question quality. Do NOT use a smaller model. |
| 196 | +- **One domain at a time**: Generate questions for one domain per invocation. The caller can invoke this skill in parallel for multiple domains. |
| 197 | +- **Preserve coordinates**: If the existing questions have well-placed coordinates near their source articles, try to reuse those coordinates for questions about the same articles. |
| 198 | +- **TodoWrite is mandatory**: Every phase transition and every completed question MUST be reflected in TodoWrite. This is non-negotiable for resumability. |
0 commit comments