feat: semantic linking across sources — podcasts, books, and offline discovery

## Summary

Automatically link semantically similar concepts across different sources — both during ingestion and offline. Expand ingestion beyond wiki pages to include podcast transcripts and technical books on relevant topics. The graph should continuously discover cross-source relationships, surfacing connections the user might never have noticed.

## Motivation

Knowledge doesn't live in silos. A concept from an Anthropic podcast episode might directly relate to a chapter in a distributed systems textbook, which connects to a wiki page on CRDT sync. Today, these connections only exist if they happen to be in the same ingestion batch. Semantic linking would make the knowledge graph a true *web* of understanding across all sources.

## Key Ideas

### Cross-source semantic linking
- During ingestion: compare new concepts against the entire existing graph using embeddings, link semantically similar nodes even if they come from completely different sources
- Offline/background: periodically re-scan the graph for semantic similarities that weren't caught at ingestion time (e.g., after new concepts shift the embedding space)
- Use cosine similarity on Claude-computed concept embeddings to suggest links, with a confidence threshold

### Expanded source types
- **Podcast transcripts**: Ingest transcripts (e.g., from Anthropic, Lex Fridman, technical podcasts) — either user-provided or fetched from podcast RSS feeds / transcript APIs
- **Technical books**: Ingest chapters or sections from relevant books (PDF, EPUB, or pasted text) — e.g., DDIA, SICP, or domain-specific references the user is studying
- **Periodic ingestion**: Schedule or prompt for re-ingestion of podcast feeds to pick up new episodes automatically

### Relationship quality
- Semantic links should include an explanation of *why* two concepts are related (analogy, shared mechanism, contrast, etc.)
- Distinguish between explicit relationships (stated in source material) and inferred relationships (discovered via embedding similarity)
- Let users confirm, reject, or refine inferred links

## Research

### Transfer Learning: The Far Transfer Problem

Transfer of learning — applying knowledge from one context to another — is one of the most studied and most elusive goals in education. Barnett & Ceci (2002) formalized the near/far transfer taxonomy.

**The sobering finding**: Sala & Gobet (2019) found in a second-order meta-analysis that when controlling for placebo effects and publication bias, **far-transfer effects are small or null**. Spontaneous far transfer essentially doesn't happen.

**However, analogical encoding changes this picture dramatically** — the research consensus is not that far transfer is impossible, but that it requires specific instructional conditions that most learning environments fail to provide.

> Barnett, S. M., & Ceci, S. J. (2002). When and where do we apply what we learn? A taxonomy for far transfer. *Psychological Bulletin*, 128(4), 612-637.
> Sala, G., & Gobet, F. (2019). Near and far transfer in cognitive training. *Collabra: Psychology*, 5(1), 18.

### Analogical Reasoning: Gentner and Holyoak

**Structure-Mapping Theory** (Gentner, 1983): Analogy works by mapping relational structure from a source domain to a target domain. Successful analogies preserve relational structure, not surface features — two concepts from different wiki collections may share relational structure even when their surface content is entirely different.

**Analogical Encoding** (Gentner, Loewenstein & Thompson, 2003): The landmark finding — comparing two cases side-by-side produces far more transfer than studying them separately. Graduate students who drew an analogy from two cases were **nearly three times more likely** to incorporate strategies from training cases into real negotiations. This directly supports a feature that surfaces cross-source connections and prompts learners to compare them.

**Multiconstraint Theory** (Holyoak & Thagard, 1989): Analogical mapping is governed by three interacting constraints: (1) structural consistency, (2) semantic similarity, and (3) pragmatic centrality. These map directly to an implementation: cosine similarity captures semantic similarity, knowledge graph structure captures structural consistency, and learner context provides pragmatic centrality.

> Gentner, D., Loewenstein, J., & Thompson, L. (2003). Learning and transfer: A general role for analogical encoding. *Journal of Educational Psychology*, 95(2), 393-408.
> Holyoak, K. J., & Thagard, P. (1989). Analogical mapping by constraint satisfaction. *Cognitive Science*, 13(3), 295-355.

### LLMs and Analogical Reasoning

Recent studies (2024-2025) show that advanced LLMs match human performance on analogical reasoning tasks, validating the use of Claude for detecting cross-domain structural parallels at extraction time — aligning with the existing extraction pipeline.

### Embedding-Based Semantic Discovery in Education

- **Cosine similarity for educational content** (MDPI Information, 2023): Knowledge graphs combined with cosine similarity of concept embeddings generate personalized educational recommendations.
- **Prerequisite discovery via embeddings** (JEDM, 2024): AI-assisted construction of educational knowledge graphs uses cosine similarity between concept embeddings to detect semantic references between concepts across different documents and courses.
- **Contextual knowledge graphs** (arXiv, 2024): Combining visual graph structures with semantic analysis can reveal novel intersections between fields — connections invisible to keyword searches but potentially transformative.

### Cross-Disciplinary Knowledge Integration

- **Cross-disciplinary learning** (arXiv, 2020): Properly scaffolded cross-disciplinary connections lead to deeper understanding, while unscaffolded exposure to multiple disciplines does not. **Scaffolding is essential.**
- **Knowledge integration from distant fields** (ScienceDirect, 2022): Integrating knowledge from seemingly distant fields is positively associated with uniqueness in contribution when properly supported.

### Key Takeaways for Implementation

1. **Far transfer is hard but achievable with analogical encoding**: Explicitly surfacing cross-domain structural parallels (rather than hoping learners discover them) makes transfer 3x more likely (Gentner et al., 2003).
2. **Holyoak's three constraints map to a scoring function**: Semantic similarity (cosine similarity of embeddings), structural consistency (graph topology), and pragmatic centrality (learner context/mastery) — all computable.
3. **Embeddings are proven for educational concept discovery**: Multiple 2023-2024 studies demonstrate transformer-based embeddings reliably identify semantic relationships across documents and disciplines.
4. **Scaffolding is essential**: The system must actively scaffold comparison — present concepts side by side, highlight structural parallels, and prompt reflection. A cross-source quiz item could ask "How does concept X from Collection A relate to concept Y from Collection B?"

## Related

- #38 — Typed relationships (semantic links need relationship types: analogy, contrast, shared-mechanism, etc.)
- #39 — Concept embeddings (foundation for cosine similarity discovery)
- #74 — Video-synchronized graph highlighting (another expanded source type)
- Extraction skill in `.claude/skills/extracting-knowledge-graph/` (needs to support new source formats)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: semantic linking across sources — podcasts, books, and offline discovery #75

Summary

Motivation

Key Ideas

Cross-source semantic linking

Expanded source types

Relationship quality

Research

Transfer Learning: The Far Transfer Problem

Analogical Reasoning: Gentner and Holyoak

LLMs and Analogical Reasoning

Embedding-Based Semantic Discovery in Education

Cross-Disciplinary Knowledge Integration

Key Takeaways for Implementation

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: semantic linking across sources — podcasts, books, and offline discovery #75

Description

Summary

Motivation

Key Ideas

Cross-source semantic linking

Expanded source types

Relationship quality

Research

Transfer Learning: The Far Transfer Problem

Analogical Reasoning: Gentner and Holyoak

LLMs and Analogical Reasoning

Embedding-Based Semantic Discovery in Education

Cross-Disciplinary Knowledge Integration

Key Takeaways for Implementation

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions