Turn your scattered AI coding sessions into a queryable knowledge graph.
Developers use 5+ AI tools every day -- Claude Code, ChatGPT, Cursor, Copilot, Grok, DeepSeek, Warp. Each session is an isolated silo. Knowledge dies when the tab closes.
You have solved the same problem three times across different tools and cannot find any of them. You debugged a Supabase auth flow in Claude Code last Tuesday, discussed the same pattern in ChatGPT a month ago, and asked Grok about JWT refresh tokens somewhere in between. None of these tools talk to each other.
Existing solutions are single-platform and flat-file. They give you search over one tool's history, not structured relationships across all of them. A grep over session logs does not tell you that FastAPI uses Pydantic or that Neo4j is a type of graph database. It just gives you walls of text.
session-graph fixes this.
session-graph extracts structured knowledge triples -- (subject, predicate, object) -- from all your AI coding sessions, links entities to Wikidata for universal disambiguation, and loads everything into a SPARQL-queryable triplestore with full provenance back to the source conversation.
"What technologies have I used across all sessions?" --> SPARQL query --> structured answer
"How does FastAPI relate to Pydantic?" --> FastAPI --uses--> Pydantic
"What sessions discussed authentication?" --> 3 sessions across Claude Code + DeepSeek
The key insight: a knowledge graph without relationships is just a tag cloud. The minimum viable extraction unit is (subject, predicate, object), not [topic1, topic2, topic3].
- Multi-platform: Ingests Claude Code, ChatGPT, DeepSeek, Grok, and Warp into a single unified graph. No other tool does this.
- Formal ontology: Composes 5 W3C/ISO standards (PROV-O, SIOC, SKOS, Dublin Core, Schema.org) instead of inventing a custom schema.
- Wikidata linking: Entities are disambiguated against 100M+ Wikidata items via
owl:sameAs. "k8s", "kubernetes", and "K8s" all resolve to Q22661306. - Full provenance: Every knowledge triple traces back to the exact source message, session, platform, and file path.
- Federated queries: SPARQL can query your local graph and Wikidata in a single query.
From real-world usage across 52 sessions:
| Metric | Value |
|---|---|
| Total triples in Fuseki | 1,334,432 |
| Sessions indexed | 607+ |
| Knowledge triples extracted | 47,868+ |
| Distinct entities | ~8,000+ |
| Wikidata-linked entities | 4,774 (~33%) |
| Curated predicates | 24 (with <1% relatedTo fallback) |
| Platforms supported | 4 (Claude Code, DeepSeek, Grok, Warp) |
| Entity linking precision | 7/7 (agentic ReAct linker) |
| Cost per 600 sessions | ~$0.60 (Vertex AI batch pricing) |
Real data from SPARQL — technologies, concepts, and session provenance linked across multiple Claude Code sessions:
Hub nodes (large blue) are highly connected technologies. Green nodes are concepts/outputs. Purple rectangles are session IDs with dashed provenance edges. The "W" badge indicates entities linked to Wikidata.
Scattered Sources Adapter Layer Knowledge Graph
----------------- ------------- ---------------
Claude Code (.jsonl) --+
DeepSeek (.json zip) --+ triple_extraction.py
Grok (.json zip) --+---> (LLM extracts s,p,o ---> Apache Jena Fuseki
Warp (SQLite) --+ from each assistant (SPARQL endpoint)
ChatGPT (planned) --+ message using 24 |
Cursor (planned) --+ curated predicates) |
| v
v SPARQL Queries
link_entities.py (14 local templates
(LangGraph ReAct + 6 Wikidata templates)
agent links to |
Wikidata QIDs) v
Claude Code Skill
(natural language -> SPARQL)
Real-time Loop (Claude Code):
Session pause → stop_hook.sh → RabbitMQ → pipeline-runner → Fuseki
(triple cache: 0 API calls for seen messages)
1. SOURCE PARSING (per platform --> RDF Turtle)
Each parser reads a platform-specific format and produces
PROV-O + SIOC session structure plus knowledge triples.
2. TRIPLE EXTRACTION (LLM-powered)
Each assistant message --> LLM --> top 10 (subject, predicate, object) triples
24 curated predicates | capped at 10 triples/message (prioritizes architecture)
Closed-world vocabulary (deviations fuzzy-matched) | retry on JSON truncation
3. ENTITY FILTERING (two-level)
Level 1: is_valid_entity() in triple_extraction.py -- rejects garbage at extraction
Level 2: is_linkable_entity() in link_entities.py -- pre-filters before Wikidata
Catches: filenames (*.py), hex colors (#8776f6), CLI flags (--force),
ICD codes (j458), snake_case identifiers, DOM selectors, etc.
48 whitelisted short terms bypass filters (ai, api, llm, rdf, sql, etc.)
4. ENTITY LINKING (context-aware, agentic)
For each entity:
+-- Normalize via entity_aliases.json (161 mappings: k8s-->kubernetes, etc.)
+-- Frequency filter: --min-sessions 2 (default) -- only links entities
| appearing in 2+ sessions (~77% reduction)
+-- Check SQLite cache
+-- If miss --> LangGraph ReAct agent (LLM + Wikidata API tool)
+-- Confidence threshold 0.7 --> owl:sameAs link
+-- Entity dedup: same QID --> owl:sameAs between aliases
5. LOAD --> Apache Jena Fuseki (SPARQL endpoint)
6. QUERY --> SPARQL (via Claude Code skill or directly)
| Platform | Parser | Format | Status |
|---|---|---|---|
| Claude Code | jsonl_to_rdf.py |
JSONL | Production |
| DeepSeek | deepseek_to_rdf.py |
JSON zip export | Production |
| Grok | grok_to_rdf.py |
JSON (MongoDB export) | Production |
| Warp | warp_to_rdf.py |
SQLite | Production |
| ChatGPT | -- | JSON export | Planned |
| Cursor | -- | SQLite / Markdown | Planned |
| VS Code Copilot | -- | JSON | Planned |
All parsers produce the same RDF schema. Entities merge by label across platforms.
git clone https://github.com/robertoshimizu/session-graph.git
cd session-graph
./setup.shThe setup script checks prerequisites, creates .env with your LLM provider, installs Python dependencies, starts Docker services (Fuseki + RabbitMQ), and runs a smoke test — all interactively.
After setup: http://localhost:3030 (Fuseki SPARQL UI) and http://localhost:15672 (RabbitMQ, devkg/devkg).
Manual setup (without setup.sh)
# 1. Configure
cp .env.example .env
# Edit .env with your LLM provider API key (see Provider Support below)
# 2. Install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# 3. Create output directories
mkdir -p output/claude output/deepseek output/grok output/warp logs
# 4. Start all services (Fuseki + RabbitMQ + pipeline-runner)
docker compose up -d
# Fuseki SPARQL UI: http://localhost:3030
# RabbitMQ Management UI: http://localhost:15672 (devkg/devkg)
# 5. Process a single session (manual)
python -m pipeline.jsonl_to_rdf path/to/session.jsonl output/claude/session.ttl
# 6. Link entities to Wikidata
PYTHONUNBUFFERED=1 python -m pipeline.link_entities \
--input output/*.ttl --output output/wikidata_links.ttl
# 7. Load into Fuseki (--auth required for Docker Fuseki)
python -m pipeline.load_fuseki output/*.ttl --auth admin:admin
# 8. Query at http://localhost:3030With Docker Compose running, every Claude Code session is automatically processed:
Claude Code session ends
→ stop_hook.sh publishes to RabbitMQ (~33ms, non-blocking)
→ pipeline-runner container picks up the job
→ Extracts triples, generates .ttl, uploads to Fuseki
→ Failed jobs go to dead-letter queue for inspection
Configure the hook in ~/.claude/settings.json:
{
"hooks": {
"Stop": [{"hooks": [{"type": "command", "command": "/path/to/hooks/stop_hook.sh", "timeout": 5}]}]
}
}Once automatic processing is running, it only captures new sessions going forward. But you likely have weeks or months of past Claude Code sessions already sitting on disk — and that's where most of the value is.
Claude Code stores every session as a .jsonl file under ~/.claude/projects/. Each project directory contains one file per session. A typical developer accumulates hundreds of sessions over a few months. Bulk processing lets you backfill all of them into the knowledge graph in one shot.
This is optional but highly recommended. The more sessions in the graph, the richer the connections — you'll find patterns and relationships you didn't know existed across your past work.
source .venv/bin/activate
# Option A: Batch (50% cheaper, parallel via Vertex AI — requires GCP setup)
python -m pipeline.bulk_batch submit --sort newest
python -m pipeline.bulk_batch status --wait --poll-interval 60
python -m pipeline.bulk_batch collect
# Option B: Sequential (simpler, works with any provider)
python -m pipeline.bulk_process --limit 50 --sort newest --skip-linking
# Then link entities to Wikidata (both options)
PYTHONUNBUFFERED=1 python -m pipeline.link_entities \
--input output/claude/*.ttl --output output/claude/wikidata_links.ttl --workers 8
# Load into Fuseki (--auth required for Docker Fuseki)
python -m pipeline.load_fuseki output/claude/*.ttl --auth admin:adminAfter the backfill, automatic processing takes over — every future session is indexed as you work, with no manual steps.
Most AI tools let you export your conversation history — DeepSeek and Grok offer JSON/zip downloads, Warp stores sessions in a local SQLite database. session-graph ingests all of them into the same knowledge graph, using the same ontology and entity vocabulary.
This is where it gets interesting: entities are linked across platforms. If you discussed "Kubernetes" in Claude Code, "k8s" in DeepSeek, and "container orchestration" in Grok, they all resolve to the same Wikidata entity and connect in the graph. You can query relationships that span tools you used months apart, on different projects, without remembering where you had each conversation.
# Export your chat history from each platform, then:
python -m pipeline.deepseek_to_rdf data/deepseek_export.zip output/deepseek/deepseek.ttl
python -m pipeline.grok_to_rdf data/grok_export.zip output/grok/grok.ttl
python -m pipeline.warp_to_rdf output/warp/warp.ttl --min-exchanges 5
# Link entities and load — same as Claude sessions
PYTHONUNBUFFERED=1 python -m pipeline.link_entities \
--input output/**/*.ttl --output output/wikidata_links.ttl
python -m pipeline.load_fuseki output/**/*.ttl --auth admin:adminOnce Fuseki has data, you don't need to write SPARQL by hand. session-graph ships with a Claude Code skill (devkg-sparql) that translates natural language questions into SPARQL queries, runs them against Fuseki, and returns formatted results.
The skill is automatically available when you work inside the session-graph repo. From any project, you can invoke it with /devkg-sparql:
You: /devkg-sparql What technologies have I used the most?
Claude: [runs SPARQL hub detection query → returns top 20 entities by degree]
You: /devkg-sparql How does FastAPI relate to Pydantic?
Claude: FastAPI --uses--> Pydantic (source: session abc123, Jan 15)
You: /devkg-sparql What sessions discussed authentication?
Claude: [returns 3 sessions across Claude Code + DeepSeek with dates and source files]
You: /devkg-sparql What do I know about Kubernetes?
Claude: [runs entity lookup, finds 12 relationships + Wikidata link to Q22661306]
The skill includes 14 local query templates (entity lookup, path discovery, hub detection, cross-session overlap, etc.) and 6 Wikidata traversal templates for enriching local entities with external knowledge. It falls back to grep-based session search if Fuseki is unreachable.
To use it from other projects, add the skill path to your Claude Code settings or symlink .claude/skills/devkg-sparql/ into your project.
Most developer tools reach for Neo4j, vector databases, or JSON files. Here is why session-graph uses RDF and SPARQL instead.
session-graph does not invent a custom schema. It composes 5 battle-tested W3C/ISO standards:
| Standard | Role | Maturity |
|---|---|---|
| PROV-O | Provenance: who did what, when, derived from what | W3C Recommendation |
| SIOC | Conversation structure: messages, threads, containers | W3C Member Submission |
| SKOS | Taxonomy: topics, broader/narrower hierarchies | W3C Recommendation |
| Dublin Core | Metadata: dates, titles, creators | ISO 15836 |
| Schema.org | Cherry-pick: SoftwareSourceCode |
De facto standard |
This same composition approach was validated by IBM's GRAPH4CODE project at 2 billion triples.
Every entity in the graph can be linked to Wikidata via owl:sameAs. This gives you:
- Universal disambiguation: "k8s", "kubernetes", and "K8s" all resolve to the same Wikidata item.
- Cross-language dedup: "medication" and "medicamento" both map to Q12140.
- External enrichment: Query Wikidata to discover that Neo4j is written in Java, or that fosfomycin is an antibiotic -- knowledge that does not exist in your local sessions.
Apache Jena Fuseki runs as a single JAR file. No JVM tuning required. It handles 138K+ triples without breaking a sweat. Compare this to Neo4j (Docker + plugins + configuration) or a hosted vector database (monthly fees).
SPARQL's SERVICE keyword lets you query your local graph and Wikidata in a single request:
# Find what Wikidata knows about entities in your local graph
SELECT ?localLabel ?wikidataDescription WHERE {
?entity a devkg:Entity ;
rdfs:label ?localLabel ;
owl:sameAs ?wd .
SERVICE <https://query.wikidata.org/sparql> {
?wd schema:description ?wikidataDescription .
FILTER(LANG(?wikidataDescription) = "en")
}
}No other query language can do this.
PROV-O gives you provenance for free. Every knowledge triple links back to:
- The exact message it was extracted from (with full text)
- The session it belongs to
- The platform (Claude Code, DeepSeek, Grok, Warp)
- The source file on disk
RDF is an ISO standard (W3C). Your data is portable. You can move it to any triplestore (Fuseki, Blazegraph, GraphDB, Stardog, Amazon Neptune) or convert it to Neo4j via n10s. Try doing that with a proprietary vector database.
session-graph supports multiple LLM providers for triple extraction and entity linking:
| Provider | Triple Extraction | Entity Linking | Batch Processing |
|---|---|---|---|
| Google Gemini (Vertex AI) | Yes | Yes | Yes (50% discount) |
| Google Gemini (AI Studio) | Yes | Yes | No |
| OpenAI | Yes | Yes | No |
| Anthropic (Claude) | Yes | Yes | No |
| Ollama (local) | Yes | Yes | No |
Configure your provider in .env:
# Pick one:
PROVIDER=gemini-vertex # Google Vertex AI (supports batch)
PROVIDER=gemini # Google AI Studio
PROVIDER=openai # OpenAI API
PROVIDER=anthropic # Anthropic API
PROVIDER=ollama # Local OllamaPREFIX devkg: <http://devkg.local/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (COUNT(DISTINCT ?triple) AS ?degree) WHERE {
{ ?triple a devkg:KnowledgeTriple ; devkg:tripleSubject ?e .
?e rdfs:label ?label . FILTER(LANG(?label) = "") }
UNION
{ ?triple a devkg:KnowledgeTriple ; devkg:tripleObject ?e .
?e rdfs:label ?label . FILTER(LANG(?label) = "") }
}
GROUP BY ?label
ORDER BY DESC(?degree)
LIMIT 20This returns the most connected entities in your graph -- the core technologies and concepts across all your sessions.
PREFIX devkg: <http://devkg.local/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sioc: <http://rdfs.org/sioc/ns#>
SELECT DISTINCT ?predicate (SUBSTR(?content, 1, 150) AS ?sourceSnippet) WHERE {
?triple a devkg:KnowledgeTriple ;
devkg:tripleSubject ?s ;
devkg:triplePredicateLabel ?predicate ;
devkg:tripleObject ?o ;
devkg:extractedFrom ?msg .
?s rdfs:label ?sLabel .
?o rdfs:label ?oLabel .
OPTIONAL { ?msg sioc:content ?content }
FILTER(
CONTAINS(LCASE(STR(?sLabel)), "fastapi") &&
CONTAINS(LCASE(STR(?oLabel)), "pydantic")
)
}Result: FastAPI --uses--> Pydantic, with a snippet from the source conversation.
PREFIX devkg: <http://devkg.local/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?label (GROUP_CONCAT(DISTINCT ?platform; separator=", ") AS ?platforms)
(COUNT(DISTINCT ?platform) AS ?platformCount) WHERE {
?triple a devkg:KnowledgeTriple ;
devkg:tripleSubject ?e ;
devkg:extractedInSession ?session .
?session devkg:hasSourcePlatform ?platform .
?e rdfs:label ?label .
}
GROUP BY ?label
HAVING(COUNT(DISTINCT ?platform) > 1)
ORDER BY DESC(?platformCount)This reveals knowledge that spans platforms -- things you discussed in both Claude Code and DeepSeek, for example.
PREFIX devkg: <http://devkg.local/ontology#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX wd: <http://www.wikidata.org/entity/>
SELECT ?label ?wikidataURI WHERE {
?entity a devkg:Entity ;
rdfs:label ?label ;
owl:sameAs ?wikidataURI .
FILTER(STRSTARTS(STR(?wikidataURI), "http://www.wikidata.org"))
FILTER(CONTAINS(LCASE(STR(?label)), "kubernetes"))
}The full SPARQL skill includes 14 local query templates and 6 Wikidata traversal templates. See pipeline/sample_queries.sparql for the complete reference.
session-graph composes 5 W3C/ISO standards into a minimal OWL ontology with 24 curated predicates for developer knowledge:
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix sioc: <http://rdfs.org/sioc/ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix schema: <http://schema.org/> .
@prefix devkg: <http://devkg.local/ontology#> .
# A session is both a PROV Activity (provenance) and a SIOC Forum (conversation)
ex:session-001 a prov:Activity, sioc:Forum ;
dcterms:created "2026-02-13T14:30:00Z"^^xsd:dateTime ;
dcterms:title "Debugging auth flow" ;
prov:wasAssociatedWith ex:developer, ex:agent-claude-code .
# A message in that session
ex:message-001 a sioc:Post, prov:Entity ;
sioc:has_container ex:session-001 ;
sioc:content "How do I handle JWT refresh?" ;
prov:wasGeneratedBy ex:session-001 .
# An extracted knowledge triple with full provenance
ex:triple-001 a devkg:KnowledgeTriple ;
devkg:tripleSubject ex:entity-fastapi ;
devkg:triplePredicateLabel "uses" ;
devkg:tripleObject ex:entity-pydantic ;
devkg:extractedFrom ex:message-042 ;
devkg:extractedInSession ex:session-001 .Closed-world design: the LLM is constrained to use only these predicates. Any deviation is fuzzy-matched to the closest one (fallback: relatedTo, kept under 1%).
| Category | Predicates |
|---|---|
| Dependencies | uses, dependsOn, requires, builtWith |
| Capabilities | enables, provides, solves, produces |
| Structure | isPartOf, hasPart, extends, implements |
| Taxonomy | isTypeOf, broader, narrower |
| Infrastructure | deployedOn, storesIn, queriedWith, configures |
| Relationships | integratesWith, composesWith, alternativeTo, servesAs, relatedTo |
Full ontology: ontology/devkg.ttl
session-graph/
+-- ontology/devkg.ttl # OWL ontology (24 predicates)
+-- pipeline/
| +-- common.py # Shared: namespaces, URI helpers
| +-- llm_providers.py # LLM provider abstraction (Gemini, OpenAI, Anthropic, Ollama)
| +-- triple_extraction.py # LLM prompt, extraction, normalization
| +-- jsonl_to_rdf.py # Claude Code JSONL --> RDF
| +-- deepseek_to_rdf.py # DeepSeek JSON --> RDF
| +-- grok_to_rdf.py # Grok JSON --> RDF
| +-- warp_to_rdf.py # Warp SQLite --> RDF
| +-- link_entities.py # Wikidata entity linking (agentic)
| +-- agentic_linker_langgraph.py # LangGraph ReAct agent
| +-- entity_aliases.json # 161 tech synonym mappings
| +-- bulk_process.py # Sequential bulk processor
| +-- bulk_batch.py # Vertex AI Batch Prediction
| +-- snapshot_links.py # Inspect entity linking progress
| +-- load_fuseki.py # Upload .ttl to Fuseki
| +-- sample_queries.sparql # 14 SPARQL query templates
| +-- .entity_cache.db # SQLite cache for Wikidata links (auto-created)
| +-- .triple_cache.db # SQLite cache for extracted triples (auto-created)
+-- docker/
| +-- queue_consumer.py # RabbitMQ consumer: dequeues jobs, runs pipeline
+-- hooks/stop_hook.sh # Post-session hook: publishes to RabbitMQ (~33ms)
+-- Dockerfile.pipeline # Python 3.12 image with pipeline deps
+-- docker-compose.yml # fuseki + rabbitmq + pipeline-runner
+-- .claude/skills/devkg-sparql/ # SPARQL skill for Claude Code
+-- tests/test_integration.sh # 16-point end-to-end integration test
+-- output/ # Generated .ttl files
+-- requirements.txt
+-- .env.example
+-- LICENSE
To add support for a new AI platform, implement a parser that reads the platform's native format and produces an rdflib.Graph with the same schema.
The key contract:
- Create sessions as
devkg:Session(subclass ofprov:Activity+sioc:Forum) - Create messages as
devkg:UserMessageordevkg:AssistantMessage - Call
triple_extraction.extract_triples(text)on each assistant message - Use
common.pyhelpers for URI generation and namespace management
See any existing parser (e.g., pipeline/jsonl_to_rdf.py) as a template. The shared modules handle all RDF construction, triple extraction, and entity normalization.
| Component | Cost |
|---|---|
| Triple extraction (batch) | ~$0.60 / 600 sessions |
| Triple extraction (real-time) | ~$1.20 / 600 sessions |
| Entity linking | ~$0.10 / 1,000 entities |
| Apache Jena Fuseki | Free (local) |
| Wikidata API | Free (no auth required) |
| Total for 600 sessions | ~$0.70 - $1.30 |
The entire pipeline runs for less than $2 on a typical developer's full session history.
- Assistant-only extraction: Only assistant messages are sent to the LLM for triple extraction. User messages are short prompts with no extractable knowledge.
- Closed-world predicates: The LLM is constrained to 24 predicates. The prompt includes wrong/correct examples to keep
relatedTofallback under 1%. - Top-10 extraction cap: Extracts at most 10 triples per message, prioritizing architectural decisions and technology choices over trivial details.
- Two-level entity filtering:
is_valid_entity()at extraction time +is_linkable_entity()before Wikidata linking. Rejects ~6% garbage (filenames, hex colors, CLI flags, ICD codes, DOM selectors, version strings). 48 whitelisted short terms bypass all filters. - Frequency-based linking:
--min-sessions 2(default) only links entities appearing in 2+ sessions. ~77% of entities are single-session noise, dramatically reducing linking cost. - Dual storage: Direct edges for fast graph traversal AND reified
KnowledgeTriplenodes for provenance. Query either depending on your needs. - Context-aware entity linking: Neighboring KnowledgeTriple relationships are passed as disambiguation context to the ReAct agent. "condition" resolves to disease (not programming conditional) when surrounded by medical triples.
- Agentic linker over heuristic: LangGraph ReAct agent (LLM + Wikidata API tool) achieves 7/7 precision vs ~50% for keyword heuristic. Resolves abbreviations like k8s, otel, tf.
- Triple extraction cache: SQLite cache (
.triple_cache.db) keyed by message UUID. The stop hook fires on every Claude Code pause, causing re-processing. The cache ensures each message's LLM extraction only happens once — re-runs rebuild the RDF graph but skip API calls for cached messages. - Incremental real-time ingestion: Stop hook → RabbitMQ → pipeline-runner → Fuseki. Each session pause triggers automatic extraction and loading. The triple cache makes repeated processing free.
| Problem | Fix |
|---|---|
| Fuseki returns 401 Unauthorized | Docker Fuseki requires auth. Use --auth admin:admin with load_fuseki.py, or pass auth=('admin', 'admin') to the Python functions. |
| RabbitMQ management UI unreachable | Wait 30s after docker compose up. Check with docker compose logs rabbitmq. Default credentials: devkg/devkg. |
| No sessions to process | bulk_process.py looks for .jsonl files under ~/.claude/projects/. Run at least one Claude Code session first. |
link_entities.py output buffered |
Use PYTHONUNBUFFERED=1 prefix: PYTHONUNBUFFERED=1 python -m pipeline.link_entities ... |
| Stop hook not firing | Verify ~/.claude/settings.json has the hook entry. The path must be absolute. Run ./setup.sh to install it automatically. |
ModuleNotFoundError |
Activate the virtualenv first: source .venv/bin/activate |
A knowledge graph without relationships is just a tag cloud. The minimum viable extraction unit is
(subject, predicate, object), not[topic1, topic2, topic3].
Put the schema in the prompt, not in post-processing. If you want the LLM to use specific predicates, give it the vocabulary explicitly with examples.
"It loads" does not mean "it answers questions." Always verify with semantic queries, not just structural ones.
- GRAPH4CODE (IBM Research) -- 2B triples, same ontology composition approach
- PROV-O: The PROV Ontology -- W3C Recommendation
- SIOC Core Ontology -- Semantically-Interlinked Online Communities
- SKOS Reference -- Simple Knowledge Organization System
- Apache Jena Fuseki -- SPARQL server
- LangGraph -- Agent orchestration framework
See CONTRIBUTING.md for guidelines on adding parsers, improving extraction, and submitting pull requests.
