- Python >= 3.10
- Rust toolchain (for building the Samyama Python SDK from source)
- IBM AssetOpsBench clone (for IBM's 139 scenarios)
- OpenAI API key (only if running the GPT-4o baseline or NLQ benchmark)
# This repo
git clone https://github.com/samyama-ai/assetops-kg.git
cd assetops-kg
# IBM AssetOpsBench (for data + 139 scenarios)
git clone https://github.com/IBM/AssetOpsBench.git ../AssetOpsBenchpython -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"The samyama>=0.6.0 dependency will build the Rust-based Python SDK via maturin. This requires a working Rust toolchain (rustup + cargo). If you don't have Rust installed:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/envpython -c "from samyama import SamyamaClient; c = SamyamaClient.embedded(); print('OK')"These scenarios test graph-native capabilities: multi-hop traversal, vector similarity, PageRank, cascade analysis, and maintenance optimization.
# Run all 40 scenarios
python -m benchmark.run_samyama --output results/my_samyama_results.json
# Run a single category
python -m benchmark.run_samyama --category criticality_analysisWhat happens:
- Creates an embedded Samyama graph database (in-memory)
- Loads synthetic industrial data: 1 Site, 4 Locations, 20 Equipment, 60 Sensors, 15 FailureModes, WorkOrders, Anomalies (781 nodes, 955 edges)
- Runs each scenario by calling graph tools directly via the Samyama Python SDK
- Scores responses against
expected_output_containsusing the 8-dimensional evaluation framework - Prints summary table and writes JSON results
Expected output:
Summary: 40/40 passed (100%), avg score 0.927
Per-category breakdown:
criticality_analysis 5/5 avg=0.938
multi_hop_dependency 8/8 avg=0.934
root_cause_analysis 5/5 avg=0.934
...
No external services required. The graph database runs embedded in the Python process.
These are IBM's original AssetOpsBench scenarios covering IoT, FMSR, Work Orders, TSFM, and Multi-agent queries.
# Run all 139 scenarios (uses ../AssetOpsBench as default data dir)
python -m benchmark.run_ibm_scenarios --output results/my_ibm_results.json
# Specify a custom data directory
python -m benchmark.run_ibm_scenarios --data-dir /path/to/AssetOpsBench --output results/my_ibm_results.json
# Run a single scenario type
python -m benchmark.run_ibm_scenarios --category iot
python -m benchmark.run_ibm_scenarios --category fmsr
python -m benchmark.run_ibm_scenarios --category wo
python -m benchmark.run_ibm_scenarios --category tsfm
python -m benchmark.run_ibm_scenarios --category multiWhat happens:
- Loads IBM scenario JSON files from
AssetOpsBench/src/tmp/assetopsbench/scenarios/ - Creates an embedded Samyama graph and runs the 8-step IBM ETL pipeline:
- 11 Chillers, 110 Sensors, 12 Failure Modes from EAMLite/CouchDB/FMSR data
- Work Orders, Alerts, Anomalies from CSV exports
- 6,256 unified Events from
event.csv
- Dispatches each scenario to the appropriate handler (IoT/FMSR/WO/TSFM/Multi)
- Evaluates responses against IBM's
characteristic_formground truth using keyword matching - Prints summary table with per-type breakdown
Expected output:
Summary: 137/139 passed (99%), avg score 0.889
Per-type breakdown:
IoT 20/20 passed, avg=0.988
FMSR 40/40 passed, avg=0.907
WO 34/36 passed, avg=0.801
TSFM 23/23 passed, avg=0.920
Multi 20/20 passed, avg=0.877
Required data files (from the AssetOpsBench clone):
src/tmp/assetopsbench/scenarios/single_agent/*.json— scenario definitionssrc/tmp/assetopsbench/scenarios/multi_agent/*.json— multi-agent scenariossrc/tmp/assetopsbench/sample_data/event.csv— 6,256 unified eventssrc/servers/fmsr/failure_modes.yaml— failure mode definitionssrc/couchdb/sample_data/— sensor metadata JSON
This runs the same 40 custom scenarios against GPT-4o with no graph access — flat data only.
# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."
# Run the baseline
python -m benchmark.run_baseline --output results/my_baseline_results.json
# With a different model
python -m benchmark.run_baseline --model gpt-4-turbo --output results/baseline_turbo.jsonWhat happens:
- Loads the same 40 scenarios
- Sends each scenario description to GPT-4o with a system prompt stating no graph/vector tools are available
- Scores responses using the same 8-dimensional framework
- Prints comparison summary
Note: Results may vary slightly between runs due to LLM non-determinism. Our published baseline (34/40, avg 0.602) was recorded on 2026-03-11.
GPT-4o generates Cypher queries against the same knowledge graph used by the deterministic handlers. Note: IBM used GPT-4 (not GPT-4o), so comparisons between NLQ and IBM's baseline involve both a model difference and a data layer difference. Use --model gpt-4 for a true same-model comparison.
# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."
# Run all 139 scenarios via NLQ
python -m benchmark.run_nlq --output results/my_nlq_results.json
# Run a single category
python -m benchmark.run_nlq --category fmsr
# Use a different provider (if supported)
python -m benchmark.run_nlq --provider anthropic --output results/nlq_claude.jsonWhat happens:
- Creates an embedded Samyama graph and runs the IBM ETL pipeline (same as
run_ibm_scenarios) - For each scenario, sends the question + graph schema + few-shot examples to GPT-4o
- GPT-4o generates a Cypher query; the runner executes it against the graph
- If execution fails, the error is fed back to GPT-4o for retry (up to 2 retries)
- GPT-4o synthesizes a natural language answer from the query results
- Scores responses against IBM's
characteristic_formground truth
Expected output:
Summary: 115/139 passed (83%), avg score 0.789
Per-type breakdown:
IoT 17/20 passed, avg=0.742
FMSR 37/40 passed, avg=0.880
WO 32/36 passed, avg=0.723
TSFM 21/23 passed, avg=0.936
Multi 8/20 passed, avg=0.605
Note: Multi stays at 40% because 12/20 Multi scenarios require TSFM pipeline execution (forecasting, anomaly detection) that cannot be expressed as Cypher queries. This is a structural limitation, not a prompt engineering problem.
# All tests
pytest tests/ -v
# Schema validation (18 tests)
pytest tests/test_schema.py -v
# Scenario JSON validation (30+ tests)
pytest tests/test_scenarios.py -vTests validate:
- Graph schema file exists and contains all 11 node labels + 16 edge types
- All 40 scenarios have required fields, valid IDs, correct categories
- No duplicate scenario IDs
- Expected tools reference known tool names
For integrating with Claude or other LLM agents via the Model Context Protocol:
# Start MCP server (stdio transport)
python -m mcp_server.serverThe server exposes 9 tools:
query_sites,query_assets,query_sensors— asset hierarchy queriesquery_failure_modes,find_similar_failures— failure mode lookup + vector searchimpact_analysis,dependency_chain— graph traversalcriticality_ranking,maintenance_clusters— graph algorithms
assetops-kg/
├── schema/
│ └── industrial_kg.cypher # Graph schema (11 node labels, 16 edge types)
├── etl/
│ ├── loader.py # Custom 40-scenario ETL (5 steps)
│ ├── ibm_loader.py # IBM 139-scenario ETL (8 steps)
│ ├── eamlite_loader.py # EAMLite → Site, Location, Equipment
│ ├── couchdb_loader.py # CouchDB JSON → Sensor + SensorReading
│ ├── fmsr_loader.py # YAML → FailureMode + MONITORS edges
│ ├── workorder_loader.py # CSV → WorkOrder nodes
│ └── embedding_gen.py # sentence-transformers → vector index
├── benchmark/
│ ├── run_samyama.py # Custom 40 scenarios (graph-native)
│ ├── run_ibm_scenarios.py # IBM's original 139 scenarios
│ ├── run_baseline.py # GPT-4o baseline (requires OPENAI_API_KEY)
│ └── run_nlq.py # NLQ benchmark — LLM generates Cypher (requires OPENAI_API_KEY)
├── evaluation/
│ ├── extended_criteria.py # 8-dimensional scoring framework
│ └── runner.py # Scenario loader + output formatter
├── mcp_server/
│ ├── server.py # FastMCP server entry point
│ └── tools/ # 4 tool modules (asset, failure, impact, analytics)
├── scenarios/ # 40 scenario JSONs (7 categories)
├── results/ # Benchmark result JSONs (v1-v5)
├── docs/
│ ├── results.md # Full benchmark analysis
│ ├── methodology.md # Scoring methodology
│ └── getting-started.md # This file
└── tests/ # pytest: schema + scenario validation
The Samyama Python SDK requires building from Rust source via maturin. Make sure you have Rust installed and ran pip install -e ".[dev]".
Ensure the event.csv file exists at <AssetOpsBench>/src/tmp/assetopsbench/sample_data/event.csv. This is the canonical source for unified event counts (6,256 events).
LLM responses are non-deterministic. Scores may vary by ±5% between runs. The published results were recorded with gpt-4o on 2026-03-11.
If maturin fails to build the Rust extension, try:
pip install maturin
cd ../samyama-graph/sdk/python
maturin develop --release
cd ../../assetops-kg
pip install -e ".[dev]"