|
| 1 | +# PROJECT KNOWLEDGE BASE |
| 2 | + |
| 3 | +**Generated:** 2026-02-16 13:54 UTC |
| 4 | +**Commit:** d3cc534 |
| 5 | +**Branch:** main |
| 6 | + |
| 7 | +## OVERVIEW |
| 8 | + |
| 9 | +Wikipedia Knowledge Map: distributed GPU pipeline that generates semantic embeddings for 250K Wikipedia articles, projects them to 2D via UMAP, enriches cells with LLM-generated labels/questions across 5 difficulty levels, and serves an interactive adaptive quiz visualization. Stack: Python (sentence-transformers, UMAP, OpenAI Batch API, TensorFlow 2.19) + vanilla HTML/JS frontend. |
| 10 | + |
| 11 | +## STRUCTURE |
| 12 | + |
| 13 | +``` |
| 14 | +mapper/ |
| 15 | +├── index.html # 3591-line monolithic visualization (KaTeX, adaptive quiz UI) |
| 16 | +├── adaptive_sampler_multilevel.js # RBF-based adaptive testing algorithm (MultiLevelAdaptiveSampler) |
| 17 | +├── run_full_pipeline.sh # Shell orchestrator: L4→L3→L2 simplification + merge |
| 18 | +├── scripts/ # Core pipeline (see scripts/AGENTS.md) |
| 19 | +│ ├── run_full_pipeline.py # Python orchestrator: 9-step pipeline with idempotency |
| 20 | +│ ├── generate_level_n.py # Hierarchical article expansion (1272 lines, most complex) |
| 21 | +│ ├── utils/ # Shared API/embedding/Wikipedia utilities (see scripts/utils/AGENTS.md) |
| 22 | +│ ├── tests/ # Module tests + benchmarks (pytest) |
| 23 | +│ └── diagnostics/ # Pipeline verification scripts |
| 24 | +├── tests/ # Root-level integration tests (embedding dims, UMAP, model loading) |
| 25 | +├── notes/ # 63 session notes, implementation plans, troubleshooting guides |
| 26 | +├── data/benchmarks/ # Batch size benchmark results |
| 27 | +├── embeddings/ # Generated .pkl embedding files (gitignored, multi-GB) |
| 28 | +├── backups/ # Data backups |
| 29 | +└── logos/ # Logo assets |
| 30 | +``` |
| 31 | + |
| 32 | +## WHERE TO LOOK |
| 33 | + |
| 34 | +| Task | Location | Notes | |
| 35 | +|------|----------|-------| |
| 36 | +| Run full pipeline | `scripts/run_full_pipeline.py` | Idempotent; use `--force` flags to rerun steps | |
| 37 | +| Simplify questions by level | `scripts/simplify_questions.py --level N` | Levels 2-4 need simplification | |
| 38 | +| Shell pipeline (simplify+merge) | `run_full_pipeline.sh` | Only runs simplification + merge, not full generation | |
| 39 | +| Generate new difficulty level | `scripts/generate_level_n.py --level N` | LLM-based article suggestion + embedding + questions | |
| 40 | +| Heatmap labels (GPT-5-nano) | `scripts/generate_heatmap_labels_gpt5.py` | Uses OpenAI Batch API | |
| 41 | +| Heatmap labels (LM Studio) | `scripts/generate_heatmap_labels.py` | Local LLM alternative | |
| 42 | +| UMAP rebuild | `scripts/rebuild_umap.py` | 30-60 min for 250K articles | |
| 43 | +| Merge levels into final JSON | `scripts/merge_multi_level_data.py` | Deduplicates articles, merges questions by cell | |
| 44 | +| OpenAI API integration | `scripts/utils/api_utils.py` | Key in `.credentials/openai.key` | |
| 45 | +| Batch API helpers | `scripts/utils/openai_batch.py` | `batch_with_cache()` for cached LLM calls | |
| 46 | +| Wikipedia download | `scripts/utils/wikipedia_utils.py` | `download_articles_batch()` | |
| 47 | +| Distributed GPU embeddings | `scripts/utils/sync_and_merge_embeddings.py` | SSH/SFTP via paramiko to tensor01/tensor02 | |
| 48 | +| Frontend visualization | `index.html` | Served via `python -m http.server 8000` | |
| 49 | +| Adaptive sampling logic | `adaptive_sampler_multilevel.js` | `MultiLevelAdaptiveSampler` class | |
| 50 | +| Debug pipeline issues | `scripts/diagnostics/diagnose_pipeline.py` | General pipeline diagnostics | |
| 51 | +| Verify labels | `scripts/diagnostics/verify_cell_labels.py` | Label quality checks | |
| 52 | + |
| 53 | +## DATA FLOW |
| 54 | + |
| 55 | +``` |
| 56 | +wikipedia.pkl (250K articles, 752MB, gitignored) |
| 57 | + ↓ rebuild_umap.py |
| 58 | +umap_coords.pkl + umap_reducer.pkl + umap_bounds.pkl |
| 59 | + ↓ find_optimal_coverage_rectangle.py |
| 60 | +optimal_rectangle.json |
| 61 | + ↓ export_wikipedia_articles.py |
| 62 | +wikipedia_articles_level_0.json |
| 63 | + ↓ generate_heatmap_labels_gpt5.py (OpenAI Batch) |
| 64 | +heatmap_cell_labels.json (1,521 labels) |
| 65 | + ↓ generate_level_n.py --level 0 (concepts + questions) |
| 66 | +level_0_concepts.json + cell_questions_level_0.json |
| 67 | + ↓ generate_level_n.py --level 1..4 (broader articles + questions) |
| 68 | +cell_questions_level_{1..4}.json + wikipedia_articles_level_{1..4}.json |
| 69 | + ↓ simplify_questions.py --level {2,3,4} |
| 70 | +cell_questions_level_{2,3,4}_simplified.json |
| 71 | + ↓ merge_multi_level_data.py |
| 72 | +wikipedia_articles.json + cell_questions.json → consumed by index.html |
| 73 | +``` |
| 74 | + |
| 75 | +## CONVENTIONS |
| 76 | + |
| 77 | +- **Credentials**: `.credentials/` directory (gitignored). `openai.key` for API, `tensor01.txt`/`tensor02.txt` for clusters. |
| 78 | +- **API key validation**: Must start with `sk-` (enforced in `api_utils.py`). |
| 79 | +- **macOS env vars**: Scripts set `TOKENIZERS_PARALLELISM=false`, `OMP_NUM_THREADS=1`, `MKL_NUM_THREADS=1` to prevent Metal threading issues. |
| 80 | +- **TensorFlow pinned**: `tensorflow==2.19.0` — 2.20 has macOS mutex blocking bug. |
| 81 | +- **Imports**: Scripts in `scripts/` use `sys.path.append(str(Path(__file__).parent.parent))` to import from `scripts.utils.*`. |
| 82 | +- **LLM model**: Pipeline uses `gpt-5-nano` via OpenAI Batch API (cost: ~$0.50/level). |
| 83 | +- **Embedding model**: `Qwen/Qwen3-Embedding-0.6B` (distributed) or `google/embeddinggemma-300m` (local). |
| 84 | +- **JSON data in root**: Pipeline outputs (`cell_questions*.json`, `wikipedia_articles.json`, etc.) live in project root, not `data/`. |
| 85 | +- **Two pipeline scripts**: `run_full_pipeline.sh` (shell, simplification only) vs `scripts/run_full_pipeline.py` (Python, full 9-step pipeline). Use the Python one. |
| 86 | +- **Idempotency**: Python pipeline checks for existing outputs and skips steps. Use `--force` to rerun. |
| 87 | + |
| 88 | +## ANTI-PATTERNS (THIS PROJECT) |
| 89 | + |
| 90 | +- **Never commit `.pkl` files** — multi-GB embedding/UMAP files are gitignored. `remove_large_files_from_history.sh` exists to clean mistakes. |
| 91 | +- **Never commit `.credentials/`** — API keys, cluster passwords. |
| 92 | +- **Never use TensorFlow >= 2.20** — macOS mutex blocking error. |
| 93 | +- **Never mock in tests** — project policy requires real API calls, real models, real I/O (see CLAUDE.md instructions). |
| 94 | +- **Never use `generate_heatmap_labels.py` in production** — use `generate_heatmap_labels_gpt5.py` (Batch API) instead. |
| 95 | +- **Never run `build_wikipedia_knowledge_map.py`** — legacy (nvidia/nemotron). Use `build_wikipedia_knowledge_map_v2.py` or the pipeline. |
| 96 | + |
| 97 | +## COMMANDS |
| 98 | + |
| 99 | +```bash |
| 100 | +# Full pipeline (idempotent, skips completed steps) |
| 101 | +python scripts/run_full_pipeline.py |
| 102 | + |
| 103 | +# Force rerun everything |
| 104 | +python scripts/run_full_pipeline.py --force |
| 105 | + |
| 106 | +# Simplification-only pipeline |
| 107 | +./run_full_pipeline.sh |
| 108 | + |
| 109 | +# Single level generation |
| 110 | +python scripts/generate_level_n.py --level 2 |
| 111 | + |
| 112 | +# Simplify specific level |
| 113 | +python scripts/simplify_questions.py --level 4 |
| 114 | + |
| 115 | +# Serve visualization |
| 116 | +python -m http.server 8000 |
| 117 | + |
| 118 | +# Run tests |
| 119 | +pytest tests/ scripts/tests/ |
| 120 | + |
| 121 | +# Distributed GPU (requires .credentials/) |
| 122 | +scripts/launch_distributed.sh |
| 123 | +python scripts/utils/sync_and_merge_embeddings.py |
| 124 | +``` |
| 125 | + |
| 126 | +## NOTES |
| 127 | + |
| 128 | +- `index.html` is a 3591-line monolith — all CSS, JS, and HTML inline. No build system. |
| 129 | +- `adaptive_sampler_multilevel.js` is the only extracted JS module. Contains RBF uncertainty estimation math. |
| 130 | +- Questions use **LaTeX notation** (`$x^2$`, `$\frac{1}{2}$`) rendered by KaTeX in the frontend. |
| 131 | +- LaTeX `$` signs require careful handling to distinguish from currency `$` — see commits d3cc534, 7018ca3, 3ab088c. |
| 132 | +- No CI/CD — all testing is manual. No GitHub Actions, no Makefile. |
| 133 | +- `notes/` contains 63 implementation logs — useful for understanding decisions but not code reference. |
| 134 | +- Cluster config: 2 clusters (tensor01, tensor02) x 8 GPUs = 16 workers. Uses `screen` sessions + `paramiko` SSH. |
| 135 | +- License: CC BY-NC-SA 4.0 (non-commercial). |
| 136 | + |
| 137 | +## Active Technologies |
| 138 | +- JavaScript ES2020+ (frontend), Python 3.11+ (pipeline) (001-demo-public-release) |
| 139 | +- localStorage (browser-side, versioned schema per FR-007 clarification). No server-side storage. (001-demo-public-release) |
| 140 | + |
| 141 | +## Recent Changes |
| 142 | +- 001-demo-public-release: Added JavaScript ES2020+ (frontend), Python 3.11+ (pipeline) |
0 commit comments