AssetOps-KG: Industrial Asset Operations Knowledge Graph

Extending IBM AssetOpsBench with graph database, vector search, and multi-objective optimization capabilities using Samyama Graph Database.

Key Results

Benchmark	GPT-4 (IBM)	NLQ (GPT-4o + graph)	Samyama-KG (deterministic)	Delta vs IBM
IBM's 139 scenarios	~91/139 (65%)*	115/139 (83%)	137/139 (99%)	+34pp
Avg latency	not reported	5,874 ms	63 ms	--
Avg tokens	not reported	4,616/scenario	0	$0

Benchmark	GPT-4o (no graph)	Samyama-KG	Delta
Custom 40 scenarios	34/40 (85%)	40/40 (100%)	+15pp
Avg latency (custom 40)	11,259 ms	110 ms	103x faster

*IBM's reported GPT-4 figure.

Same-model comparison (GPT-4 vs GPT-4): IBM's GPT-4 over flat docs scores 65%. Our GPT-4 over graph NLQ scores 82% — a +17pp improvement using the exact same model, proving the gain comes from the data model, not the LLM. The GPT-4 → GPT-4o uplift is only ~1pp (82% → 83%). Deterministic handlers reach 99% with zero LLM calls.

Full analysis: docs/results.md | Scoring methodology: docs/methodology.md | Reproducing results: docs/getting-started.md

Thesis

IBM's AssetOpsBench benchmarks whether LLM agents can autonomously handle industrial maintenance tasks. Their GPT-4 agents achieve 65% using flat document stores where the LLM must do everything -- intent parsing, tool selection, data reasoning, answer synthesis.

We show that the bottleneck is the data model, not the LLM. Replacing flat storage with a knowledge graph improves results at every level of LLM involvement. The key insight is inverted LLM usage: instead of asking the LLM to reason over raw data (hard, error-prone), ask it to generate a structured query from a schema (narrow, plays to LLM strengths).

How It Works

IBM's approach:     Question → LLM does EVERYTHING → answer
                    (intent + tool selection + data reasoning + synthesis)

NLQ approach:       Question → LLM generates Cypher (sharp problem) → graph executes → answer
                    (LLM does code generation — its strength)

Handler approach:   Question → keyword routing → Cypher query → answer
                    (no LLM — pre-coded for known patterns)

The graph handles what LLMs are bad at (traversal, counting, relationships). The LLM handles what it's good at (query generation from schema). This separation of concerns is why NLQ (+18pp over IBM) and deterministic (+34pp) both outperform.

Graph Schema

11 node labels, 16 edge types, 781 nodes, 955 edges (see schema/industrial_kg.cypher):

Site -[CONTAINS_LOCATION]-> Location -[CONTAINS_EQUIPMENT]-> Equipment -[HAS_SENSOR]-> Sensor
                                                              |
                                            DEPENDS_ON / SHARES_SYSTEM_WITH
                                                              |
FailureMode -[MONITORS]-> Equipment -[EXPERIENCED]-> FailureMode
WorkOrder -[FOR_EQUIPMENT]-> Equipment
WorkOrder -[ADDRESSES]-> FailureMode
WorkOrder -[USES_PART]-> SparePart -[SUPPLIED_BY]-> Supplier
Anomaly -[TRIGGERED]-> WorkOrder
Event -[FOR_EQUIPMENT]-> Equipment

Project Structure

assetops-kg/
├── schema/                    # Graph schema (Cypher CREATE statements)
├── etl/                       # ETL pipeline (AssetOpsBench -> Samyama KG)
│   ├── loader.py              # Main orchestrator — custom 40 scenarios
│   ├── ibm_loader.py          # IBM data ETL — 8-step pipeline for 139 scenarios
│   ├── eamlite_loader.py      # EAMLite -> Site, Location, Equipment
│   ├── couchdb_loader.py      # CouchDB JSON -> Sensor + SensorReading
│   ├── fmsr_loader.py         # YAML -> FailureMode + MONITORS edges
│   └── embedding_gen.py       # sentence-transformers -> vector index
├── mcp_server/                # FastMCP server (9 tools)
│   ├── server.py              # MCP entry point
│   └── tools/
│       ├── asset_tools.py     # query_assets, query_sensors, query_sites
│       ├── failure_tools.py   # find_similar_failures, query_failure_modes
│       ├── impact_tools.py    # impact_analysis, dependency_chain
│       └── analytics_tools.py # criticality_ranking, maintenance_clusters
├── scenarios/                 # 40 new scenario JSONs (7 categories)
├── evaluation/                # 8-dimensional scoring framework
│   ├── extended_criteria.py   # 6 original + 2 graph-specific dimensions
│   └── runner.py              # Benchmark runner
├── benchmark/                 # Benchmark runners
│   ├── run_samyama.py         # Custom 40 scenarios
│   ├── run_baseline.py        # GPT-4o baseline for custom 40
│   ├── run_ibm_scenarios.py   # IBM's original 139 scenarios
│   └── run_nlq.py             # NLQ benchmark (GPT-4o generates Cypher)
├── docs/
│   ├── results.md             # Full benchmark analysis
│   ├── methodology.md         # Scoring and evaluation methodology
│   └── getting-started.md     # Setup, reproduction, troubleshooting
├── results/                   # Benchmark result JSONs (v1-v5)
└── tests/

Quick Start

See docs/getting-started.md for full setup instructions, prerequisites, and troubleshooting.

# Clone and install
git clone https://github.com/samyama-ai/assetops-kg.git && cd assetops-kg
git clone https://github.com/IBM/AssetOpsBench.git ../AssetOpsBench
pip install -e ".[dev]"

# Run custom 40 scenarios (100%, avg 0.927)
python -m benchmark.run_samyama --output results/samyama_results.json

# Run IBM's 139 scenarios (99%, avg 0.889)
python -m benchmark.run_ibm_scenarios --data-dir ../AssetOpsBench --output results/ibm_results.json

# Run GPT-4o baseline for comparison (requires OPENAI_API_KEY)
python -m benchmark.run_baseline --output results/baseline_results.json

# Run NLQ benchmark — GPT-4o generates Cypher against the graph (requires OPENAI_API_KEY)
python -m benchmark.run_nlq --output results/nlq_results.json

# Run tests
pytest tests/ -v

# Start MCP server (for agent integration)
python -m mcp_server.server

Benchmark Results

IBM's Original 139 Scenarios

Approach	Pass Rate	Avg Score	Avg Latency	Tokens
GPT-4 + flat docs (IBM)	~91/139 (65%)	not reported	not reported	not reported
GPT-4 + graph NLQ	114/139 (82%)	0.790	~5,800 ms	~4,600/scenario
GPT-4o + graph NLQ	115/139 (83%)	0.789	5,874 ms	4,616/scenario
Deterministic (graph)	137/139 (99%)	0.889	63 ms	0

Per-Type Breakdown (GPT-4 NLQ vs GPT-4o NLQ vs Deterministic)

Type	GPT-4 NLQ	GPT-4o NLQ	Deterministic
IoT (20)	17/20 (85%)	17/20 (85%)	20/20 (100%)
FMSR (40)	38/40 (95%)	37/40 (93%)	40/40 (100%)
TSFM (23)	22/23 (96%)	21/23 (91%)	23/23 (100%)
Multi (20)	8/20 (40%)	8/20 (40%)	20/20 (100%)
WO (36)	29/36 (81%)	32/36 (89%)	34/36 (94%)

Same-model comparison: GPT-4 + graph NLQ (82%) vs IBM's GPT-4 + flat docs (65%) = +17pp using the exact same model. The GPT-4 → GPT-4o uplift is only ~1pp, proving the gain is from the data model. Only 2 deterministic failures remain (WO bundling edge cases).

Custom 40 Scenarios (Graph-Native)

Category	GPT-4o	Samyama-KG	Delta
Failure similarity	3/6 (0.501)	6/6 (0.902)	+0.401
Criticality analysis	3/5 (0.566)	5/5 (0.938)	+0.372
Root cause analysis	5/5 (0.580)	5/5 (0.934)	+0.354
Multi-hop dependency	7/8 (0.618)	8/8 (0.934)	+0.316
Maintenance optimization	5/5 (0.634)	5/5 (0.931)	+0.297
Cross-asset correlation	6/6 (0.638)	6/6 (0.929)	+0.291
Temporal pattern	5/5 (0.679)	5/5 (0.923)	+0.244

Largest gains on failure similarity (+0.401) and criticality analysis (+0.372) -- exactly where graph structure and vector search provide the most value.

40 New Scenarios (7 Categories)

Category	Count	Example
Multi-hop dependency	8	"What equipment is affected if Chiller 6 fails?"
Cross-asset correlation	6	"Are AHU anomalies correlated with chiller temperature drops?"
Failure pattern similarity	6	"Which pumps had failures similar to Motor 3?"
Criticality analysis	5	"Rank all equipment by operational criticality"
Maintenance optimization	5	"Schedule maintenance minimizing downtime + cost"
Root cause analysis	5	"Trace events leading to WO-2024-0042"
Temporal pattern	5	"What is MTBF for Chiller 6's compressor?"

Evaluation Methodology

Full details: docs/methodology.md

Single pass, no repeated runs. Each scenario gets one handler call, one response, one score. "Avg score" is the arithmetic mean across all scenarios.

IBM 139 scenarios are scored by keyword matching against the characteristic_form ground truth field. Three paths: strict item matching (deterministic + items), count matching (deterministic + counts), or lenient keyword overlap (non-deterministic, with 1.5x boost). Pass threshold: score >= 0.5.

Custom 40 scenarios use 8 weighted dimensions:

Dimension	Weight	What It Measures
Correctness	0.20	Expected keywords present in response
Completeness	0.15	Coverage of required information
Relevance	0.10	Question terms reflected in answer
Tool Usage	0.15	Correct graph tools invoked
Efficiency	0.05	Latency and token usage
Safety	0.10	No unsafe maintenance recommendations
Graph Utilization	0.15	Evidence of graph traversal, not flat-data reasoning
Semantic Precision	0.10	Quality of vector similarity matching

Category-specific weight overrides boost the most relevant dimension (e.g., Semantic Precision → 0.25 for failure similarity scenarios).

License

Apache 2.0 (same as AssetOpsBench)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AssetOps-KG: Industrial Asset Operations Knowledge Graph

Key Results

Thesis

How It Works

Graph Schema

Project Structure

Quick Start

Benchmark Results

IBM's Original 139 Scenarios

Per-Type Breakdown (GPT-4 NLQ vs GPT-4o NLQ vs Deterministic)

Custom 40 Scenarios (Graph-Native)

40 New Scenarios (7 Categories)

Evaluation Methodology

Related

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
benchmark		benchmark
docs		docs
etl		etl
evaluation		evaluation
mcp_server		mcp_server
results		results
scenarios		scenarios
schema		schema
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

AssetOps-KG: Industrial Asset Operations Knowledge Graph

Key Results

Thesis

How It Works

Graph Schema

Project Structure

Quick Start

Benchmark Results

IBM's Original 139 Scenarios

Per-Type Breakdown (GPT-4 NLQ vs GPT-4o NLQ vs Deterministic)

Custom 40 Scenarios (Graph-Native)

40 New Scenarios (7 Categories)

Evaluation Methodology

Related

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages