Skip to content

Commit fae9ae4

Browse files
Merge pull request #19 from ContextLab/001-demo-public-release
feat: interactive knowledge mapper demo for public release
2 parents d3cc534 + 95b8b20 commit fae9ae4

File tree

144 files changed

+1914200
-3478
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

144 files changed

+1914200
-3478
lines changed

.github/workflows/deploy.yml

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
name: Deploy to GitHub Pages
2+
3+
on:
4+
push:
5+
branches: [main]
6+
workflow_dispatch:
7+
8+
permissions:
9+
contents: read
10+
pages: write
11+
id-token: write
12+
13+
concurrency:
14+
group: pages
15+
cancel-in-progress: false
16+
17+
jobs:
18+
build:
19+
runs-on: ubuntu-latest
20+
steps:
21+
- uses: actions/checkout@v4
22+
- uses: actions/setup-node@v4
23+
with:
24+
node-version: 20
25+
cache: npm
26+
- run: npm ci
27+
- run: npm run build
28+
- uses: actions/upload-pages-artifact@v3
29+
with:
30+
path: dist
31+
32+
deploy:
33+
needs: build
34+
runs-on: ubuntu-latest
35+
environment:
36+
name: github-pages
37+
url: ${{ steps.deployment.outputs.page_url }}
38+
steps:
39+
- uses: actions/deploy-pages@v4
40+
id: deployment

.gitignore

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,12 @@ marimo/_static/
209209
marimo/_lsp/
210210
__marimo__/
211211

212+
# Node.js / Frontend
213+
node_modules/
214+
dist/
215+
.vite/
216+
*.local
217+
212218
# Project-specific
213219
CLAUDE.md
214220

@@ -247,6 +253,22 @@ checkpoints/level_*_final.json
247253
# Large output files (>50MB)
248254
wikipedia_articles_level_*.json
249255
level_*_concepts.json
256+
257+
# Legacy pipeline data files (root-level, superseded by data/domains/)
258+
cell_distances.json
259+
cell_questions_level_*.json
260+
cell_questions.json
261+
level_*_concepts_checkpoint.json
262+
neighbor_analysis.json
263+
optimal_rectangle.json
264+
question_coordinates.json
265+
questions.json
266+
review.html
267+
find_domain_clusters.py
268+
review.html
269+
270+
# Playwright test artifacts
271+
test-results/
250272
.DS_Store
251273
*/.DS_Store
252274
*/*/.DS_Store

AGENTS.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# PROJECT KNOWLEDGE BASE
2+
3+
**Generated:** 2026-02-16 13:54 UTC
4+
**Commit:** d3cc534
5+
**Branch:** main
6+
7+
## OVERVIEW
8+
9+
Wikipedia Knowledge Map: distributed GPU pipeline that generates semantic embeddings for 250K Wikipedia articles, projects them to 2D via UMAP, enriches cells with LLM-generated labels/questions across 5 difficulty levels, and serves an interactive adaptive quiz visualization. Stack: Python (sentence-transformers, UMAP, OpenAI Batch API, TensorFlow 2.19) + vanilla HTML/JS frontend.
10+
11+
## STRUCTURE
12+
13+
```
14+
mapper/
15+
├── index.html # 3591-line monolithic visualization (KaTeX, adaptive quiz UI)
16+
├── adaptive_sampler_multilevel.js # RBF-based adaptive testing algorithm (MultiLevelAdaptiveSampler)
17+
├── run_full_pipeline.sh # Shell orchestrator: L4→L3→L2 simplification + merge
18+
├── scripts/ # Core pipeline (see scripts/AGENTS.md)
19+
│ ├── run_full_pipeline.py # Python orchestrator: 9-step pipeline with idempotency
20+
│ ├── generate_level_n.py # Hierarchical article expansion (1272 lines, most complex)
21+
│ ├── utils/ # Shared API/embedding/Wikipedia utilities (see scripts/utils/AGENTS.md)
22+
│ ├── tests/ # Module tests + benchmarks (pytest)
23+
│ └── diagnostics/ # Pipeline verification scripts
24+
├── tests/ # Root-level integration tests (embedding dims, UMAP, model loading)
25+
├── notes/ # 63 session notes, implementation plans, troubleshooting guides
26+
├── data/benchmarks/ # Batch size benchmark results
27+
├── embeddings/ # Generated .pkl embedding files (gitignored, multi-GB)
28+
├── backups/ # Data backups
29+
└── logos/ # Logo assets
30+
```
31+
32+
## WHERE TO LOOK
33+
34+
| Task | Location | Notes |
35+
|------|----------|-------|
36+
| Run full pipeline | `scripts/run_full_pipeline.py` | Idempotent; use `--force` flags to rerun steps |
37+
| Simplify questions by level | `scripts/simplify_questions.py --level N` | Levels 2-4 need simplification |
38+
| Shell pipeline (simplify+merge) | `run_full_pipeline.sh` | Only runs simplification + merge, not full generation |
39+
| Generate new difficulty level | `scripts/generate_level_n.py --level N` | LLM-based article suggestion + embedding + questions |
40+
| Heatmap labels (GPT-5-nano) | `scripts/generate_heatmap_labels_gpt5.py` | Uses OpenAI Batch API |
41+
| Heatmap labels (LM Studio) | `scripts/generate_heatmap_labels.py` | Local LLM alternative |
42+
| UMAP rebuild | `scripts/rebuild_umap.py` | 30-60 min for 250K articles |
43+
| Merge levels into final JSON | `scripts/merge_multi_level_data.py` | Deduplicates articles, merges questions by cell |
44+
| OpenAI API integration | `scripts/utils/api_utils.py` | Key in `.credentials/openai.key` |
45+
| Batch API helpers | `scripts/utils/openai_batch.py` | `batch_with_cache()` for cached LLM calls |
46+
| Wikipedia download | `scripts/utils/wikipedia_utils.py` | `download_articles_batch()` |
47+
| Distributed GPU embeddings | `scripts/utils/sync_and_merge_embeddings.py` | SSH/SFTP via paramiko to tensor01/tensor02 |
48+
| Frontend visualization | `index.html` | Served via `python -m http.server 8000` |
49+
| Adaptive sampling logic | `adaptive_sampler_multilevel.js` | `MultiLevelAdaptiveSampler` class |
50+
| Debug pipeline issues | `scripts/diagnostics/diagnose_pipeline.py` | General pipeline diagnostics |
51+
| Verify labels | `scripts/diagnostics/verify_cell_labels.py` | Label quality checks |
52+
53+
## DATA FLOW
54+
55+
```
56+
wikipedia.pkl (250K articles, 752MB, gitignored)
57+
↓ rebuild_umap.py
58+
umap_coords.pkl + umap_reducer.pkl + umap_bounds.pkl
59+
↓ find_optimal_coverage_rectangle.py
60+
optimal_rectangle.json
61+
↓ export_wikipedia_articles.py
62+
wikipedia_articles_level_0.json
63+
↓ generate_heatmap_labels_gpt5.py (OpenAI Batch)
64+
heatmap_cell_labels.json (1,521 labels)
65+
↓ generate_level_n.py --level 0 (concepts + questions)
66+
level_0_concepts.json + cell_questions_level_0.json
67+
↓ generate_level_n.py --level 1..4 (broader articles + questions)
68+
cell_questions_level_{1..4}.json + wikipedia_articles_level_{1..4}.json
69+
↓ simplify_questions.py --level {2,3,4}
70+
cell_questions_level_{2,3,4}_simplified.json
71+
↓ merge_multi_level_data.py
72+
wikipedia_articles.json + cell_questions.json → consumed by index.html
73+
```
74+
75+
## CONVENTIONS
76+
77+
- **Credentials**: `.credentials/` directory (gitignored). `openai.key` for API, `tensor01.txt`/`tensor02.txt` for clusters.
78+
- **API key validation**: Must start with `sk-` (enforced in `api_utils.py`).
79+
- **macOS env vars**: Scripts set `TOKENIZERS_PARALLELISM=false`, `OMP_NUM_THREADS=1`, `MKL_NUM_THREADS=1` to prevent Metal threading issues.
80+
- **TensorFlow pinned**: `tensorflow==2.19.0` — 2.20 has macOS mutex blocking bug.
81+
- **Imports**: Scripts in `scripts/` use `sys.path.append(str(Path(__file__).parent.parent))` to import from `scripts.utils.*`.
82+
- **LLM model**: Pipeline uses `gpt-5-nano` via OpenAI Batch API (cost: ~$0.50/level).
83+
- **Embedding model**: `Qwen/Qwen3-Embedding-0.6B` (distributed) or `google/embeddinggemma-300m` (local).
84+
- **JSON data in root**: Pipeline outputs (`cell_questions*.json`, `wikipedia_articles.json`, etc.) live in project root, not `data/`.
85+
- **Two pipeline scripts**: `run_full_pipeline.sh` (shell, simplification only) vs `scripts/run_full_pipeline.py` (Python, full 9-step pipeline). Use the Python one.
86+
- **Idempotency**: Python pipeline checks for existing outputs and skips steps. Use `--force` to rerun.
87+
88+
## ANTI-PATTERNS (THIS PROJECT)
89+
90+
- **Never commit `.pkl` files** — multi-GB embedding/UMAP files are gitignored. `remove_large_files_from_history.sh` exists to clean mistakes.
91+
- **Never commit `.credentials/`** — API keys, cluster passwords.
92+
- **Never use TensorFlow >= 2.20** — macOS mutex blocking error.
93+
- **Never mock in tests** — project policy requires real API calls, real models, real I/O (see CLAUDE.md instructions).
94+
- **Never use `generate_heatmap_labels.py` in production** — use `generate_heatmap_labels_gpt5.py` (Batch API) instead.
95+
- **Never run `build_wikipedia_knowledge_map.py`** — legacy (nvidia/nemotron). Use `build_wikipedia_knowledge_map_v2.py` or the pipeline.
96+
97+
## COMMANDS
98+
99+
```bash
100+
# Full pipeline (idempotent, skips completed steps)
101+
python scripts/run_full_pipeline.py
102+
103+
# Force rerun everything
104+
python scripts/run_full_pipeline.py --force
105+
106+
# Simplification-only pipeline
107+
./run_full_pipeline.sh
108+
109+
# Single level generation
110+
python scripts/generate_level_n.py --level 2
111+
112+
# Simplify specific level
113+
python scripts/simplify_questions.py --level 4
114+
115+
# Serve visualization
116+
python -m http.server 8000
117+
118+
# Run tests
119+
pytest tests/ scripts/tests/
120+
121+
# Distributed GPU (requires .credentials/)
122+
scripts/launch_distributed.sh
123+
python scripts/utils/sync_and_merge_embeddings.py
124+
```
125+
126+
## NOTES
127+
128+
- `index.html` is a 3591-line monolith — all CSS, JS, and HTML inline. No build system.
129+
- `adaptive_sampler_multilevel.js` is the only extracted JS module. Contains RBF uncertainty estimation math.
130+
- Questions use **LaTeX notation** (`$x^2$`, `$\frac{1}{2}$`) rendered by KaTeX in the frontend.
131+
- LaTeX `$` signs require careful handling to distinguish from currency `$` — see commits d3cc534, 7018ca3, 3ab088c.
132+
- No CI/CD — all testing is manual. No GitHub Actions, no Makefile.
133+
- `notes/` contains 63 implementation logs — useful for understanding decisions but not code reference.
134+
- Cluster config: 2 clusters (tensor01, tensor02) x 8 GPUs = 16 workers. Uses `screen` sessions + `paramiko` SSH.
135+
- License: CC BY-NC-SA 4.0 (non-commercial).
136+
137+
## Active Technologies
138+
- JavaScript ES2020+ (frontend), Python 3.11+ (pipeline) (001-demo-public-release)
139+
- localStorage (browser-side, versioned schema per FR-007 clarification). No server-side storage. (001-demo-public-release)
140+
141+
## Recent Changes
142+
- 001-demo-public-release: Added JavaScript ES2020+ (frontend), Python 3.11+ (pipeline)

0 commit comments

Comments
 (0)