Skip to content

Commit b25bf92

Browse files
English Words as a Knowledge Graph, 1800–2019
0 parents  commit b25bf92

16 files changed

+1592
-0
lines changed

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Project artifacts
2+
venv/
3+
data/
4+
artifacts/
5+
outputs/
6+
7+
# Python cache
8+
__pycache__/
9+
*.pyc
10+
11+
# macOS metadata
12+
.DS_Store
13+
data/
14+
outputs/

README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# English Lexicon Time Machine
2+
3+
> Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.
4+
5+
## How this repo is put together
6+
7+
- **Zero-config takeover**`./setup.sh` spins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output.
8+
- **Radial growth cinematics** – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal.
9+
- **Repeatable science** – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders.
10+
- **Battle-tested** – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p.
11+
12+
Share it, remix it, drop it in your next data-viz thread.
13+
14+
## Quickstart
15+
16+
```bash
17+
cd /Users/grey/Projects/graph-visualizations
18+
bash setup.sh
19+
```
20+
21+
The script will:
22+
23+
1. Create/upgrade `venv/` with Python 3.
24+
2. Download Wiktionary + Google Books 1-gram shards (`a``z`).
25+
3. Extract English lemmas, infer first-use years, aggregate prefix counts.
26+
4. Render 220 radial frames (`outputs/frames/frame-0000.png``frame-0219.png`).
27+
5. Encode `outputs/english_trie_timelapse.mp4` and a share-ready GIF.
28+
29+
Rerun the script anytime—artifact caching means future passes jump straight to rendering.
30+
31+
## Anatomy
32+
33+
| Stage | Script | Output |
34+
|-------|--------|--------|
35+
| Lemma extraction | `src/ingest/wiktionary_extract.py` | `artifacts/lemmas/lemmas.tsv` |
36+
| First-year inference | `src/ingest/ngram_first_year.py` | `artifacts/years/first_years.tsv` |
37+
| Prefix aggregation | `src/build/build_prefix_trie.py` | `artifacts/trie/prefix_counts.jsonl` |
38+
| Layout generation | `src/viz/layout.py` | `artifacts/layout/prefix_positions.json` (legacy back-compat) |
39+
| Frame rendering | `src/viz/render_frames.py` | `outputs/frames/` |
40+
| Encoding | `src/viz/encode.py` | `outputs/english_trie_timelapse.mp4` + `.gif` |
41+
42+
## Render Only (after initial run)
43+
44+
```bash
45+
source venv/bin/activate
46+
python -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/frames
47+
python -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gif
48+
```
49+
50+
Use flags such as `--min-radius`, `--max-radius`, `--base-edge-alpha`, or `--start-progress` to tune the vibe.
51+
52+
## Neo4j Playground (Optional)
53+
54+
Load `artifacts/years/first_years.tsv` to explore in Neo4j (Community & Enterprise safe):
55+
56+
```cypher
57+
:param batch => $rows;
58+
UNWIND $rows AS row
59+
WITH row WHERE row.word IS NOT NULL AND row.word <> ""
60+
MERGE (w:Word {text: row.word})
61+
SET w.first_year = CASE
62+
WHEN row.first_year = "" THEN NULL
63+
ELSE toInteger(row.first_year)
64+
END;
65+
```
66+
67+
## Share-Worthy Ideas
68+
69+
- Drop the GIF in language history threads (#linguistics #dataart).
70+
- Remix the radial layout with alternative color ramps or depth cutoffs.
71+
- Pair the timelapse with poetry readings for maximum feels.
72+
73+
## Credits
74+
75+
- Wiktionary community & Google Books N-gram team for open data.
76+
- You, for showing the world how beautifully language grows.
77+
78+
## Community
79+
80+
For more open source software and content on Knowledge Graphs, GNNs, and Graph Databases, [Join our community on X!](https://x.com/i/communities/1977449294861881612)

neo4j/cypher/load_words.cypher

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
// words.tsv: word<TAB>first_year
2+
:param batch => $rows;
3+
UNWIND $rows AS row
4+
WITH row
5+
WHERE row.word IS NOT NULL AND row.word <> ""
6+
MERGE (w:Word {text: row.word})
7+
SET w.first_year = CASE
8+
WHEN row.first_year IS NULL OR row.first_year = "" THEN NULL
9+
ELSE toInteger(row.first_year)
10+
END;

requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
lxml>=5.2,<6
2+
Pillow>=10.0,<11

setup.sh

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# Constants
5+
REPO_ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6+
VENV_DIR="$REPO_ROOT/venv"
7+
PYTHON="python3"
8+
DATA_DIR="$REPO_ROOT/data"
9+
WIKTIONARY_URL="https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles.xml.bz2"
10+
NGGRAM_BASE="https://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-"
11+
NGGRAM_SHARDS=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
12+
ARTIFACT_CACHE="$REPO_ROOT/artifacts/cache.tar.gz"
13+
ARTIFACT_META_DIR="$REPO_ROOT/artifacts/metadata"
14+
SHARD_RECORD="$ARTIFACT_META_DIR/ngram_shards.txt"
15+
16+
log() {
17+
printf '[setup] %s\n' "$1"
18+
}
19+
20+
ensure_python() {
21+
if ! command -v "$PYTHON" >/dev/null 2>&1; then
22+
log "python3 not found; install Python 3 before running this script"
23+
exit 1
24+
fi
25+
}
26+
27+
ensure_http_clients() {
28+
if command -v curl >/dev/null 2>&1; then
29+
HTTP_CLIENT="curl"
30+
elif command -v wget >/dev/null 2>&1; then
31+
HTTP_CLIENT="wget"
32+
else
33+
log "neither curl nor wget is available; install one to download datasets"
34+
exit 1
35+
fi
36+
}
37+
38+
create_venv() {
39+
if [ ! -d "$VENV_DIR" ]; then
40+
log "creating virtual environment"
41+
"$PYTHON" -m venv "$VENV_DIR"
42+
fi
43+
# shellcheck disable=SC1091
44+
source "$VENV_DIR/bin/activate"
45+
log "upgrading pip"
46+
pip install --upgrade pip
47+
}
48+
49+
install_requirements() {
50+
log "installing Python dependencies"
51+
pip install -r "$REPO_ROOT/requirements.txt"
52+
}
53+
54+
ensure_dirs() {
55+
log "creating data and artifact directories"
56+
mkdir -p \
57+
"$REPO_ROOT/data/wiktionary" \
58+
"$REPO_ROOT/data/ngrams" \
59+
"$REPO_ROOT/artifacts/lemmas" \
60+
"$REPO_ROOT/artifacts/years" \
61+
"$REPO_ROOT/artifacts/trie" \
62+
"$REPO_ROOT/artifacts/layout" \
63+
"$REPO_ROOT/outputs/frames" \
64+
"$ARTIFACT_META_DIR"
65+
}
66+
67+
restore_artifact_cache() {
68+
if [ -f "$ARTIFACT_CACHE" ]; then
69+
log "restoring cached artifacts"
70+
tar -xzf "$ARTIFACT_CACHE" -C "$REPO_ROOT"
71+
fi
72+
}
73+
74+
checkpoint_artifacts() {
75+
if [ -d "$REPO_ROOT/artifacts" ]; then
76+
log "saving artifact cache"
77+
mkdir -p "$REPO_ROOT/artifacts"
78+
tar --exclude="$(basename "$ARTIFACT_CACHE")" -czf "$ARTIFACT_CACHE" -C "$REPO_ROOT" artifacts
79+
fi
80+
}
81+
82+
download_wiktionary() {
83+
local target="$DATA_DIR/wiktionary/enwiktionary-latest-pages-articles.xml.bz2"
84+
if [ -f "$target" ]; then
85+
log "wiktionary dump already present"
86+
return
87+
fi
88+
log "downloading wiktionary dump"
89+
mkdir -p "$(dirname "$target")"
90+
if [ "$HTTP_CLIENT" = "curl" ]; then
91+
curl -L "$WIKTIONARY_URL" -o "$target"
92+
else
93+
wget -O "$target" "$WIKTIONARY_URL"
94+
fi
95+
}
96+
97+
is_gzip() {
98+
local file="$1"
99+
"$PYTHON" -c 'import io, sys
100+
from pathlib import Path
101+
path = Path(sys.argv[1])
102+
try:
103+
with path.open("rb") as handle:
104+
head = handle.read(2)
105+
sys.exit(0 if head == b"\x1f\x8b" else 1)
106+
except FileNotFoundError:
107+
sys.exit(1)
108+
except OSError:
109+
sys.exit(1)' "$file"
110+
}
111+
112+
download_ngrams() {
113+
mkdir -p "$DATA_DIR/ngrams"
114+
for legacy in "$DATA_DIR"/ngrams/eng-all-1gram-*.gz; do
115+
if [ -e "$legacy" ]; then
116+
log "removing legacy shard $(basename "$legacy")"
117+
rm -f "$legacy"
118+
fi
119+
done
120+
for shard in "${NGGRAM_SHARDS[@]}"; do
121+
local name="${NGGRAM_BASE##*/}${shard}.gz"
122+
local target="$DATA_DIR/ngrams/${name}"
123+
local url="${NGGRAM_BASE}${shard}.gz"
124+
if [ -f "$target" ]; then
125+
if is_gzip "$target"; then
126+
log "ngram shard ${name} already present"
127+
continue
128+
fi
129+
log "existing shard ${name} is invalid; re-downloading"
130+
rm -f "$target"
131+
fi
132+
log "downloading ngram shard ${name}"
133+
if [ "$HTTP_CLIENT" = "curl" ]; then
134+
curl -L "$url" -o "$target"
135+
else
136+
wget -O "$target" "$url"
137+
fi
138+
if ! is_gzip "$target"; then
139+
log "downloaded shard ${name} is not a valid gzip; please check the URL"
140+
exit 1
141+
fi
142+
done
143+
}
144+
145+
run_pipelines() {
146+
# shellcheck disable=SC1091
147+
source "$VENV_DIR/bin/activate"
148+
local expected_shards="${NGGRAM_SHARDS[*]}"
149+
local rebuild=0
150+
if [ ! -f "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl" ]; then
151+
rebuild=1
152+
elif [ ! -f "$SHARD_RECORD" ]; then
153+
rebuild=1
154+
else
155+
local recorded
156+
recorded=$(<"$SHARD_RECORD")
157+
if [ "$recorded" != "$expected_shards" ]; then
158+
rebuild=1
159+
fi
160+
fi
161+
if [ "$rebuild" -eq 1 ]; then
162+
log "extracting lemmas from wiktionary"
163+
python -m src.ingest.wiktionary_extract "$DATA_DIR/wiktionary/enwiktionary-latest-pages-articles.xml.bz2" "$REPO_ROOT/artifacts/lemmas/lemmas.tsv"
164+
log "computing first-year data"
165+
python -m src.ingest.ngram_first_year "$REPO_ROOT/artifacts/lemmas/lemmas.tsv" "$DATA_DIR/ngrams" "$REPO_ROOT/artifacts/years/first_years.tsv"
166+
log "building prefix trie"
167+
python -m src.build.build_prefix_trie "$REPO_ROOT/artifacts/years/first_years.tsv" "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl"
168+
printf '%s
169+
' "$expected_shards" >"$SHARD_RECORD"
170+
checkpoint_artifacts
171+
else
172+
log "cached prefix counts match shard set; skipping ingest and build"
173+
fi
174+
log "rendering frames"
175+
python -m src.viz.render_frames "$REPO_ROOT/artifacts/trie/prefix_counts.jsonl" "$REPO_ROOT/outputs/frames"
176+
log "encoding video and gif"
177+
python -m src.viz.encode "$REPO_ROOT/outputs/frames" "$REPO_ROOT/outputs/english_trie_timelapse.mp4" "$REPO_ROOT/outputs/english_trie_timelapse.gif"
178+
}
179+
180+
main() {
181+
ensure_python
182+
ensure_http_clients
183+
create_venv
184+
install_requirements
185+
ensure_dirs
186+
restore_artifact_cache
187+
download_wiktionary
188+
download_ngrams
189+
run_pipelines
190+
log "setup complete. activate with 'source venv/bin/activate'"
191+
}
192+
193+
main "$@"

src/__init__.py

Whitespace-only changes.

src/build/__init__.py

Whitespace-only changes.

src/build/build_prefix_trie.py

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
"""Build prefix counts per year up to depth 6."""
2+
3+
from __future__ import annotations
4+
5+
import argparse
6+
import json
7+
from collections import defaultdict
8+
from dataclasses import dataclass
9+
from pathlib import Path
10+
11+
12+
@dataclass(slots=True)
13+
class Config:
14+
first_years_path: Path
15+
output_path: Path
16+
depth: int = 6
17+
start_year: int = 1800
18+
end_year: int = 2019
19+
20+
21+
def load_first_years(path: Path) -> list[tuple[str, int]]:
22+
results: list[tuple[str, int]] = []
23+
with open(path, "r", encoding="utf-8") as handle:
24+
for line in handle:
25+
word, year_str = line.rstrip("\n").split("\t")
26+
if not year_str:
27+
continue
28+
year = int(year_str)
29+
results.append((word, year))
30+
return results
31+
32+
33+
def build_counts(
34+
data: list[tuple[str, int]], config: Config
35+
) -> dict[tuple[str, int], dict[int, int]]:
36+
counts: dict[tuple[str, int], dict[int, int]] = defaultdict(lambda: defaultdict(int))
37+
for word, year in data:
38+
if year < config.start_year or year > config.end_year:
39+
continue
40+
for depth in range(1, min(len(word), config.depth) + 1):
41+
prefix = word[:depth]
42+
counts[(prefix, depth)][year] += 1
43+
return counts
44+
45+
46+
def cumulative_counts(
47+
counts: dict[tuple[str, int], dict[int, int]], config: Config
48+
) -> dict[tuple[str, int, int], int]:
49+
cumulative: dict[tuple[str, int, int], int] = {}
50+
for (prefix, depth), year_counts in counts.items():
51+
total = 0
52+
for year in range(config.start_year, config.end_year + 1):
53+
total += year_counts.get(year, 0)
54+
cumulative[(prefix, depth, year)] = total
55+
return cumulative
56+
57+
58+
def write_jsonl(data: dict[tuple[str, int, int], int], path: Path) -> None:
59+
path.parent.mkdir(parents=True, exist_ok=True)
60+
with open(path, "w", encoding="utf-8") as handle:
61+
for (prefix, depth, year), count in sorted(data.items()):
62+
obj = {
63+
"prefix": prefix,
64+
"depth": depth,
65+
"year": year,
66+
"cumulative_count": count,
67+
}
68+
handle.write(json.dumps(obj) + "\n")
69+
70+
71+
def main() -> None:
72+
parser = argparse.ArgumentParser(description=__doc__)
73+
parser.add_argument("first_years", type=Path)
74+
parser.add_argument("output", type=Path)
75+
parser.add_argument("--depth", type=int, default=6)
76+
parser.add_argument("--start", type=int, default=1800)
77+
parser.add_argument("--end", type=int, default=2019)
78+
args = parser.parse_args()
79+
config = Config(
80+
first_years_path=args.first_years,
81+
output_path=args.output,
82+
depth=args.depth,
83+
start_year=args.start,
84+
end_year=args.end,
85+
)
86+
data = load_first_years(config.first_years_path)
87+
counts = build_counts(data, config)
88+
cumulative = cumulative_counts(counts, config)
89+
write_jsonl(cumulative, config.output_path)
90+
91+
92+
if __name__ == "__main__":
93+
main()

src/ingest/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)