Extending IBM AssetOpsBench with graph database, vector search, and multi-objective optimization capabilities using Samyama Graph Database.
| Benchmark | GPT-4 (IBM) | NLQ (GPT-4o + graph) | Samyama-KG (deterministic) | Delta vs IBM |
|---|---|---|---|---|
| IBM's 139 scenarios | ~91/139 (65%)* | 115/139 (83%) | 137/139 (99%) | +34pp |
| Avg latency | not reported | 5,874 ms | 63 ms | -- |
| Avg tokens | not reported | 4,616/scenario | 0 | $0 |
| Benchmark | GPT-4o (no graph) | Samyama-KG | Delta |
|---|---|---|---|
| Custom 40 scenarios | 34/40 (85%) | 40/40 (100%) | +15pp |
| Avg latency (custom 40) | 11,259 ms | 110 ms | 103x faster |
*IBM's reported GPT-4 figure.
Same-model comparison (GPT-4 vs GPT-4): IBM's GPT-4 over flat docs scores 65%. Our GPT-4 over graph NLQ scores 82% — a +17pp improvement using the exact same model, proving the gain comes from the data model, not the LLM. The GPT-4 → GPT-4o uplift is only ~1pp (82% → 83%). Deterministic handlers reach 99% with zero LLM calls.
Full analysis: docs/results.md | Scoring methodology: docs/methodology.md | Reproducing results: docs/getting-started.md
IBM's AssetOpsBench benchmarks whether LLM agents can autonomously handle industrial maintenance tasks. Their GPT-4 agents achieve 65% using flat document stores where the LLM must do everything -- intent parsing, tool selection, data reasoning, answer synthesis.
We show that the bottleneck is the data model, not the LLM. Replacing flat storage with a knowledge graph improves results at every level of LLM involvement. The key insight is inverted LLM usage: instead of asking the LLM to reason over raw data (hard, error-prone), ask it to generate a structured query from a schema (narrow, plays to LLM strengths).
IBM's approach: Question → LLM does EVERYTHING → answer
(intent + tool selection + data reasoning + synthesis)
NLQ approach: Question → LLM generates Cypher (sharp problem) → graph executes → answer
(LLM does code generation — its strength)
Handler approach: Question → keyword routing → Cypher query → answer
(no LLM — pre-coded for known patterns)
The graph handles what LLMs are bad at (traversal, counting, relationships). The LLM handles what it's good at (query generation from schema). This separation of concerns is why NLQ (+18pp over IBM) and deterministic (+34pp) both outperform.
11 node labels, 16 edge types, 781 nodes, 955 edges (see schema/industrial_kg.cypher):
Site -[CONTAINS_LOCATION]-> Location -[CONTAINS_EQUIPMENT]-> Equipment -[HAS_SENSOR]-> Sensor
|
DEPENDS_ON / SHARES_SYSTEM_WITH
|
FailureMode -[MONITORS]-> Equipment -[EXPERIENCED]-> FailureMode
WorkOrder -[FOR_EQUIPMENT]-> Equipment
WorkOrder -[ADDRESSES]-> FailureMode
WorkOrder -[USES_PART]-> SparePart -[SUPPLIED_BY]-> Supplier
Anomaly -[TRIGGERED]-> WorkOrder
Event -[FOR_EQUIPMENT]-> Equipment
assetops-kg/
├── schema/ # Graph schema (Cypher CREATE statements)
├── etl/ # ETL pipeline (AssetOpsBench -> Samyama KG)
│ ├── loader.py # Main orchestrator — custom 40 scenarios
│ ├── ibm_loader.py # IBM data ETL — 8-step pipeline for 139 scenarios
│ ├── eamlite_loader.py # EAMLite -> Site, Location, Equipment
│ ├── couchdb_loader.py # CouchDB JSON -> Sensor + SensorReading
│ ├── fmsr_loader.py # YAML -> FailureMode + MONITORS edges
│ └── embedding_gen.py # sentence-transformers -> vector index
├── mcp_server/ # FastMCP server (9 tools)
│ ├── server.py # MCP entry point
│ └── tools/
│ ├── asset_tools.py # query_assets, query_sensors, query_sites
│ ├── failure_tools.py # find_similar_failures, query_failure_modes
│ ├── impact_tools.py # impact_analysis, dependency_chain
│ └── analytics_tools.py # criticality_ranking, maintenance_clusters
├── scenarios/ # 40 new scenario JSONs (7 categories)
├── evaluation/ # 8-dimensional scoring framework
│ ├── extended_criteria.py # 6 original + 2 graph-specific dimensions
│ └── runner.py # Benchmark runner
├── benchmark/ # Benchmark runners
│ ├── run_samyama.py # Custom 40 scenarios
│ ├── run_baseline.py # GPT-4o baseline for custom 40
│ ├── run_ibm_scenarios.py # IBM's original 139 scenarios
│ └── run_nlq.py # NLQ benchmark (GPT-4o generates Cypher)
├── docs/
│ ├── results.md # Full benchmark analysis
│ ├── methodology.md # Scoring and evaluation methodology
│ └── getting-started.md # Setup, reproduction, troubleshooting
├── results/ # Benchmark result JSONs (v1-v5)
└── tests/
See docs/getting-started.md for full setup instructions, prerequisites, and troubleshooting.
# Clone and install
git clone https://github.com/samyama-ai/assetops-kg.git && cd assetops-kg
git clone https://github.com/IBM/AssetOpsBench.git ../AssetOpsBench
pip install -e ".[dev]"
# Run custom 40 scenarios (100%, avg 0.927)
python -m benchmark.run_samyama --output results/samyama_results.json
# Run IBM's 139 scenarios (99%, avg 0.889)
python -m benchmark.run_ibm_scenarios --data-dir ../AssetOpsBench --output results/ibm_results.json
# Run GPT-4o baseline for comparison (requires OPENAI_API_KEY)
python -m benchmark.run_baseline --output results/baseline_results.json
# Run NLQ benchmark — GPT-4o generates Cypher against the graph (requires OPENAI_API_KEY)
python -m benchmark.run_nlq --output results/nlq_results.json
# Run tests
pytest tests/ -v
# Start MCP server (for agent integration)
python -m mcp_server.server| Approach | Pass Rate | Avg Score | Avg Latency | Tokens |
|---|---|---|---|---|
| GPT-4 + flat docs (IBM) | ~91/139 (65%) | not reported | not reported | not reported |
| GPT-4 + graph NLQ | 114/139 (82%) | 0.790 | ~5,800 ms | ~4,600/scenario |
| GPT-4o + graph NLQ | 115/139 (83%) | 0.789 | 5,874 ms | 4,616/scenario |
| Deterministic (graph) | 137/139 (99%) | 0.889 | 63 ms | 0 |
| Type | GPT-4 NLQ | GPT-4o NLQ | Deterministic |
|---|---|---|---|
| IoT (20) | 17/20 (85%) | 17/20 (85%) | 20/20 (100%) |
| FMSR (40) | 38/40 (95%) | 37/40 (93%) | 40/40 (100%) |
| TSFM (23) | 22/23 (96%) | 21/23 (91%) | 23/23 (100%) |
| Multi (20) | 8/20 (40%) | 8/20 (40%) | 20/20 (100%) |
| WO (36) | 29/36 (81%) | 32/36 (89%) | 34/36 (94%) |
Same-model comparison: GPT-4 + graph NLQ (82%) vs IBM's GPT-4 + flat docs (65%) = +17pp using the exact same model. The GPT-4 → GPT-4o uplift is only ~1pp, proving the gain is from the data model. Only 2 deterministic failures remain (WO bundling edge cases).
| Category | GPT-4o | Samyama-KG | Delta |
|---|---|---|---|
| Failure similarity | 3/6 (0.501) | 6/6 (0.902) | +0.401 |
| Criticality analysis | 3/5 (0.566) | 5/5 (0.938) | +0.372 |
| Root cause analysis | 5/5 (0.580) | 5/5 (0.934) | +0.354 |
| Multi-hop dependency | 7/8 (0.618) | 8/8 (0.934) | +0.316 |
| Maintenance optimization | 5/5 (0.634) | 5/5 (0.931) | +0.297 |
| Cross-asset correlation | 6/6 (0.638) | 6/6 (0.929) | +0.291 |
| Temporal pattern | 5/5 (0.679) | 5/5 (0.923) | +0.244 |
Largest gains on failure similarity (+0.401) and criticality analysis (+0.372) -- exactly where graph structure and vector search provide the most value.
| Category | Count | Example |
|---|---|---|
| Multi-hop dependency | 8 | "What equipment is affected if Chiller 6 fails?" |
| Cross-asset correlation | 6 | "Are AHU anomalies correlated with chiller temperature drops?" |
| Failure pattern similarity | 6 | "Which pumps had failures similar to Motor 3?" |
| Criticality analysis | 5 | "Rank all equipment by operational criticality" |
| Maintenance optimization | 5 | "Schedule maintenance minimizing downtime + cost" |
| Root cause analysis | 5 | "Trace events leading to WO-2024-0042" |
| Temporal pattern | 5 | "What is MTBF for Chiller 6's compressor?" |
Full details: docs/methodology.md
Single pass, no repeated runs. Each scenario gets one handler call, one response, one score. "Avg score" is the arithmetic mean across all scenarios.
IBM 139 scenarios are scored by keyword matching against the characteristic_form ground truth field. Three paths: strict item matching (deterministic + items), count matching (deterministic + counts), or lenient keyword overlap (non-deterministic, with 1.5x boost). Pass threshold: score >= 0.5.
Custom 40 scenarios use 8 weighted dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Correctness | 0.20 | Expected keywords present in response |
| Completeness | 0.15 | Coverage of required information |
| Relevance | 0.10 | Question terms reflected in answer |
| Tool Usage | 0.15 | Correct graph tools invoked |
| Efficiency | 0.05 | Latency and token usage |
| Safety | 0.10 | No unsafe maintenance recommendations |
| Graph Utilization | 0.15 | Evidence of graph traversal, not flat-data reasoning |
| Semantic Precision | 0.10 | Quality of vector similarity matching |
Category-specific weight overrides boost the most relevant dimension (e.g., Semantic Precision → 0.25 for failure similarity scenarios).
- IBM AssetOpsBench -- Original benchmark (141 scenarios, 9 asset classes)
- Samyama Graph Database -- High-performance graph DB with OpenCypher, vector search, optimization
- Industrial KG Demo -- Rust example (871 lines)
Apache 2.0 (same as AssetOpsBench)