GraphTechnologyDevelopers
diff --git a/‎.gitignore‎
Lines changed: 14 additions & 0 deletions b/‎.gitignore‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 80 additions & 0 deletions b/‎README.md‎
Lines changed: 80 additions & 0 deletions
diff --git a/‎neo4j/cypher/load_words.cypher‎
Lines changed: 10 additions & 0 deletions b/‎neo4j/cypher/load_words.cypher‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎requirements.txt‎
Lines changed: 2 additions & 0 deletions b/‎requirements.txt‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎setup.sh‎
Lines changed: 193 additions & 0 deletions b/‎setup.sh‎
Lines changed: 193 additions & 0 deletions
diff --git a/‎src/__init__.py‎ b/‎src/__init__.py‎
diff --git a/‎src/build/__init__.py‎ b/‎src/build/__init__.py‎
diff --git a/‎src/build/build_prefix_trie.py‎
Lines changed: 93 additions & 0 deletions b/‎src/build/build_prefix_trie.py‎
Lines changed: 93 additions & 0 deletions
diff --git a/‎src/ingest/__init__.py‎ b/‎src/ingest/__init__.py‎
@@ -0,0 +1,14 @@
+# Project artifacts
+venv/
+data/
+artifacts/
+outputs/
+
+# Python cache
+__pycache__/
+*.pyc
+
+# macOS metadata
+.DS_Store
+data/
+outputs/
@@ -0,0 +1,80 @@
+# English Lexicon Time Machine
+
+> Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.
+
+## How this repo is put together
+
+- **Zero-config takeover** – `./setup.sh` spins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output.
+- **Radial growth cinematics** – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal.
+- **Repeatable science** – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders.
+- **Battle-tested** – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p.
+
+Share it, remix it, drop it in your next data-viz thread.
+
+## Quickstart
+
+```bash
+cd /Users/grey/Projects/graph-visualizations
+bash setup.sh
+```
+
+The script will:
+
+1. Create/upgrade `venv/` with Python 3.
+2. Download Wiktionary + Google Books 1-gram shards (`a`–`z`).
+3. Extract English lemmas, infer first-use years, aggregate prefix counts.
+4. Render 220 radial frames (`outputs/frames/frame-0000.png` → `frame-0219.png`).
+5. Encode `outputs/english_trie_timelapse.mp4` and a share-ready GIF.
+
+Rerun the script anytime—artifact caching means future passes jump straight to rendering.
+
+## Anatomy
+
+| Stage | Script | Output |
+|-------|--------|--------|
+| Lemma extraction | `src/ingest/wiktionary_extract.py` | `artifacts/lemmas/lemmas.tsv` |
+| First-year inference | `src/ingest/ngram_first_year.py` | `artifacts/years/first_years.tsv` |
+| Prefix aggregation | `src/build/build_prefix_trie.py` | `artifacts/trie/prefix_counts.jsonl` |
+| Layout generation | `src/viz/layout.py` | `artifacts/layout/prefix_positions.json` (legacy back-compat) |
+| Frame rendering | `src/viz/render_frames.py` | `outputs/frames/` |
+| Encoding | `src/viz/encode.py` | `outputs/english_trie_timelapse.mp4` + `.gif` |
+
+## Render Only (after initial run)
+
+```bash
+source venv/bin/activate
+python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames
+python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif
+```
+
+Use flags such as `--min-radius`, `--max-radius`, `--base-edge-alpha`, or `--start-progress` to tune the vibe.
+
+## Neo4j Playground (Optional)
+
+Load `artifacts/years/first_years.tsv` to explore in Neo4j (Community & Enterprise safe):
+
+```cypher
+:param batch => $rows;
+UNWIND $rows AS row
+WITH row WHERE row.word IS NOT NULL AND row.word <> ""
+MERGE (w:Word {text: row.word})
+SET w.first_year = CASE
+  WHEN row.first_year = "" THEN NULL
+  ELSE toInteger(row.first_year)
+END;
+```
+
+## Share-Worthy Ideas
+
+- Drop the GIF in language history threads (#linguistics #dataart).
+- Remix the radial layout with alternative color ramps or depth cutoffs.
+- Pair the timelapse with poetry readings for maximum feels.
+
+## Credits
+
+- Wiktionary community & Google Books N-gram team for open data.
+- You, for showing the world how beautifully language grows.
+
+## Community
+
+For more open source software and content on Knowledge Graphs, GNNs, and Graph Databases, [Join our community on X!](https://x.com/i/communities/1977449294861881612)
@@ -0,0 +1,10 @@
+// words.tsv: word<TAB>first_year
+:param batch => $rows;
+UNWIND $rows AS row
+WITH row
+WHERE row.word IS NOT NULL AND row.word <> ""
+MERGE (w:Word {text: row.word})
+SET w.first_year = CASE
+    WHEN row.first_year IS NULL OR row.first_year = "" THEN NULL
+    ELSE toInteger(row.first_year)
+END;
@@ -0,0 +1,2 @@
+lxml>=5.2,<6
+Pillow>=10.0,<11
@@ -0,0 +1,193 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Constants
+REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+VENV_DIR="$REPO_ROOT/venv"
+PYTHON="python3"
+DATA_DIR="$REPO_ROOT/data"
+WIKTIONARY_URL="https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2"
+NGGRAM_BASE="https://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-"
+NGGRAM_SHARDS=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
+ARTIFACT_CACHE="$REPO_ROOT/artifacts/cache.tar.gz"
+ARTIFACT_META_DIR="$REPO_ROOT/artifacts/metadata"
+SHARD_RECORD="$ARTIFACT_META_DIR/ngram_shards.txt"
+
+log() {
+  printf '[setup] %s\n' "$1"
+}
+
+ensure_python() {
+  if ! command -v "$PYTHON" >/dev/null 2>&1; then
+    log "python3 not found; install Python 3 before running this script"
+    exit 1
+  fi
+}
+
+ensure_http_clients() {
+  if command -v curl >/dev/null 2>&1; then
+    HTTP_CLIENT="curl"
+  elif command -v wget >/dev/null 2>&1; then
+    HTTP_CLIENT="wget"
+  else
+    log "neither curl nor wget is available; install one to download datasets"
+    exit 1
+  fi
+}
+
+create_venv() {
+  if [ ! -d "$VENV_DIR" ]; then
+    log "creating virtual environment"
+    "$PYTHON" -m venv "$VENV_DIR"
+  fi
+  # shellcheck disable=SC1091
+  source "$VENV_DIR/bin/activate"
+  log "upgrading pip"
+  pip install --upgrade pip
+}
+
+install_requirements() {
+  log "installing Python dependencies"
+  pip install -r "$REPO_ROOT/requirements.txt"
+}
+
+ensure_dirs() {
+  log "creating data and artifact directories"
+  mkdir -p \
+    "$REPO_ROOT/data/wiktionary" \
+    "$REPO_ROOT/data/ngrams" \
+    "$REPO_ROOT/artifacts/lemmas" \
+    "$REPO_ROOT/artifacts/years" \
+    "$REPO_ROOT/artifacts/trie" \
+    "$REPO_ROOT/artifacts/layout" \
+    "$REPO_ROOT/outputs/frames" \
+    "$ARTIFACT_META_DIR"
+}
+
+restore_artifact_cache() {
+  if [ -f "$ARTIFACT_CACHE" ]; then
+    log "restoring cached artifacts"
+    tar -xzf "$ARTIFACT_CACHE" -C "$REPO_ROOT"
+  fi
+}
+
+checkpoint_artifacts() {
+  if [ -d "$REPO_ROOT/artifacts" ]; then
+    log "saving artifact cache"
+    mkdir -p "$REPO_ROOT/artifacts"
+    tar --exclude="$(basename "$ARTIFACT_CACHE")" -czf "$ARTIFACT_CACHE" -C "$REPO_ROOT" artifacts
+  fi
+}
+
+download_wiktionary() {
+  local target="$DATA_DIR/wiktionary/enwiktionary-latest-pages-articles.xml.bz2"
+  if [ -f "$target" ]; then
+    log "wiktionary dump already present"
+    return
+  fi
+  log "downloading wiktionary dump"
+  mkdir -p "$(dirname "$target")"
+  if [ "$HTTP_CLIENT" = "curl" ]; then
+    curl -L "$WIKTIONARY_URL" -o "$target"
+  else
+    wget -O "$target" "$WIKTIONARY_URL"
+  fi
+}
+
+is_gzip() {
+  local file="$1"
+  "$PYTHON" -c 'import io, sys
+from pathlib import Path
+path = Path(sys.argv[1])
+try:
+    with path.open("rb") as handle:
+        head = handle.read(2)
+    sys.exit(0 if head == b"\x1f\x8b" else 1)
+except FileNotFoundError:
+    sys.exit(1)
+except OSError:
+    sys.exit(1)' "$file"
+}
+
+download_ngrams() {
+  mkdir -p "$DATA_DIR/ngrams"
+  for legacy in "$DATA_DIR"/ngrams/eng-all-1gram-*.gz; do
+    if [ -e "$legacy" ]; then
+      log "removing legacy shard $(basename "$legacy")"
+      rm -f "$legacy"
+    fi
+  done
+  for shard in "${NGGRAM_SHARDS[@]}"; do
+    local name="${NGGRAM_BASE##*/}${shard}.gz"
+    local target="$DATA_DIR/ngrams/${name}"
+    local url="${NGGRAM_BASE}${shard}.gz"
+    if [ -f "$target" ]; then
+      if is_gzip "$target"; then
+        log "ngram shard ${name} already present"
+        continue
+      fi
+      log "existing shard ${name} is invalid; re-downloading"
+      rm -f "$target"
+    fi
+    log "downloading ngram shard ${name}"
+    if [ "$HTTP_CLIENT" = "curl" ]; then
+      curl -L "$url" -o "$target"
+    else
+      wget -O "$target" "$url"
+    fi
+    if ! is_gzip "$target"; then
+      log "downloaded shard ${name} is not a valid gzip; please check the URL"
+      exit 1
+    fi
+  done
+}
+
+run_pipelines() {
+  # shellcheck disable=SC1091
+  source "$VENV_DIR/bin/activate"
+  local expected_shards="${NGGRAM_SHARDS[*]}"
+  local rebuild=0
+  if [ ! -f "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl" ]; then
+    rebuild=1
+  elif [ ! -f "$SHARD_RECORD" ]; then
+    rebuild=1
+  else
+    local recorded
+    recorded=$(<"$SHARD_RECORD")
+    if [ "$recorded" != "$expected_shards" ]; then
+      rebuild=1
+    fi
+  fi
+  if [ "$rebuild" -eq 1 ]; then
+    log "extracting lemmas from wiktionary"
+    python -m src.ingest.wiktionary_extract "$DATA_DIR/wiktionary/enwiktionary-latest-pages-articles.xml.bz2" "$REPO_ROOT/artifacts/lemmas/lemmas.tsv"
+    log "computing first-year data"
+    python -m src.ingest.ngram_first_year "$REPO_ROOT/artifacts/lemmas/lemmas.tsv" "$DATA_DIR/ngrams" "$REPO_ROOT/artifacts/years/first_years.tsv"
+    log "building prefix trie"
+    python -m src.build.build_prefix_trie "$REPO_ROOT/artifacts/years/first_years.tsv" "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl"
+    printf '%s
+' "$expected_shards" >"$SHARD_RECORD"
+    checkpoint_artifacts
+  else
+    log "cached prefix counts match shard set; skipping ingest and build"
+  fi
+  log "rendering frames"
+  python -m src.viz.render_frames "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl" "$REPO_ROOT/outputs/frames"
+  log "encoding video and gif"
+  python -m src.viz.encode "$REPO_ROOT/outputs/frames" "$REPO_ROOT/outputs/english_trie_timelapse.mp4" "$REPO_ROOT/outputs/english_trie_timelapse.gif"
+}
+
+main() {
+  ensure_python
+  ensure_http_clients
+  create_venv
+  install_requirements
+  ensure_dirs
+  restore_artifact_cache
+  download_wiktionary
+  download_ngrams
+  run_pipelines
+  log "setup complete. activate with 'source venv/bin/activate'"
+}
+
+main "$@"
@@ -0,0 +1,93 @@
+"""Build prefix counts per year up to depth 6."""
+
+from __future__ import annotations
+
+import argparse
+import json
+from collections import defaultdict
+from dataclasses import dataclass
+from pathlib import Path
+
+
+@dataclass(slots=True)
+class Config:
+    first_years_path: Path
+    output_path: Path
+    depth: int = 6
+    start_year: int = 1800
+    end_year: int = 2019
+
+
+def load_first_years(path: Path) -> list[tuple[str, int]]:
+    results: list[tuple[str, int]] = []
+    with open(path, "r", encoding="utf-8") as handle:
+        for line in handle:
+            word, year_str = line.rstrip("\n").split("\t")
+            if not year_str:
+                continue
+            year = int(year_str)
+            results.append((word, year))
+    return results
+
+
+def build_counts(
+    data: list[tuple[str, int]], config: Config
+) -> dict[tuple[str, int], dict[int, int]]:
+    counts: dict[tuple[str, int], dict[int, int]] = defaultdict(lambda: defaultdict(int))
+    for word, year in data:
+        if year < config.start_year or year > config.end_year:
+            continue
+        for depth in range(1, min(len(word), config.depth) + 1):
+            prefix = word[:depth]
+            counts[(prefix, depth)][year] += 1
+    return counts
+
+
+def cumulative_counts(
+    counts: dict[tuple[str, int], dict[int, int]], config: Config
+) -> dict[tuple[str, int, int], int]:
+    cumulative: dict[tuple[str, int, int], int] = {}
+    for (prefix, depth), year_counts in counts.items():
+        total = 0
+        for year in range(config.start_year, config.end_year + 1):
+            total += year_counts.get(year, 0)
+            cumulative[(prefix, depth, year)] = total
+    return cumulative
+
+
+def write_jsonl(data: dict[tuple[str, int, int], int], path: Path) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w", encoding="utf-8") as handle:
+        for (prefix, depth, year), count in sorted(data.items()):
+            obj = {
+                "prefix": prefix,
+                "depth": depth,
+                "year": year,
+                "cumulative_count": count,
+            }
+            handle.write(json.dumps(obj) + "\n")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("first_years", type=Path)
+    parser.add_argument("output", type=Path)
+    parser.add_argument("--depth", type=int, default=6)
+    parser.add_argument("--start", type=int, default=1800)
+    parser.add_argument("--end", type=int, default=2019)
+    args = parser.parse_args()
+    config = Config(
+        first_years_path=args.first_years,
+        output_path=args.output,
+        depth=args.depth,
+        start_year=args.start,
+        end_year=args.end,
+    )
+    data = load_first_years(config.first_years_path)
+    counts = build_counts(data, config)
+    cumulative = cumulative_counts(counts, config)
+    write_jsonl(cumulative, config.output_path)
+
+
+if __name__ == "__main__":
+    main()