Getting Started — Reproducing Results

Prerequisites

Python >= 3.10
Rust toolchain (for building the Samyama Python SDK from source)
IBM AssetOpsBench clone (for IBM's 139 scenarios)
OpenAI API key (only if running the GPT-4o baseline or NLQ benchmark)

Setup

1. Clone the repositories

# This repo
git clone https://github.com/samyama-ai/assetops-kg.git
cd assetops-kg

# IBM AssetOpsBench (for data + 139 scenarios)
git clone https://github.com/IBM/AssetOpsBench.git ../AssetOpsBench

2. Create a virtual environment and install dependencies

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

The samyama>=0.6.0 dependency will build the Rust-based Python SDK via maturin. This requires a working Rust toolchain (rustup + cargo). If you don't have Rust installed:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

3. Verify installation

python -c "from samyama import SamyamaClient; c = SamyamaClient.embedded(); print('OK')"

Reproducing the Custom 40 Scenarios (Samyama-KG: 100%, avg 0.927)

These scenarios test graph-native capabilities: multi-hop traversal, vector similarity, PageRank, cascade analysis, and maintenance optimization.

# Run all 40 scenarios
python -m benchmark.run_samyama --output results/my_samyama_results.json

# Run a single category
python -m benchmark.run_samyama --category criticality_analysis

What happens:

Creates an embedded Samyama graph database (in-memory)
Loads synthetic industrial data: 1 Site, 4 Locations, 20 Equipment, 60 Sensors, 15 FailureModes, WorkOrders, Anomalies (781 nodes, 955 edges)
Runs each scenario by calling graph tools directly via the Samyama Python SDK
Scores responses against expected_output_contains using the 8-dimensional evaluation framework
Prints summary table and writes JSON results

Expected output:

Summary: 40/40 passed (100%), avg score 0.927

Per-category breakdown:
  criticality_analysis      5/5   avg=0.938
  multi_hop_dependency      8/8   avg=0.934
  root_cause_analysis       5/5   avg=0.934
  ...

No external services required. The graph database runs embedded in the Python process.

Reproducing IBM's 139 Scenarios (Samyama-KG: 99%, avg 0.889)

These are IBM's original AssetOpsBench scenarios covering IoT, FMSR, Work Orders, TSFM, and Multi-agent queries.

# Run all 139 scenarios (uses ../AssetOpsBench as default data dir)
python -m benchmark.run_ibm_scenarios --output results/my_ibm_results.json

# Specify a custom data directory
python -m benchmark.run_ibm_scenarios --data-dir /path/to/AssetOpsBench --output results/my_ibm_results.json

# Run a single scenario type
python -m benchmark.run_ibm_scenarios --category iot
python -m benchmark.run_ibm_scenarios --category fmsr
python -m benchmark.run_ibm_scenarios --category wo
python -m benchmark.run_ibm_scenarios --category tsfm
python -m benchmark.run_ibm_scenarios --category multi

What happens:

Loads IBM scenario JSON files from AssetOpsBench/src/tmp/assetopsbench/scenarios/
Creates an embedded Samyama graph and runs the 8-step IBM ETL pipeline:
- 11 Chillers, 110 Sensors, 12 Failure Modes from EAMLite/CouchDB/FMSR data
- Work Orders, Alerts, Anomalies from CSV exports
- 6,256 unified Events from event.csv
Dispatches each scenario to the appropriate handler (IoT/FMSR/WO/TSFM/Multi)
Evaluates responses against IBM's characteristic_form ground truth using keyword matching
Prints summary table with per-type breakdown

Expected output:

Summary: 137/139 passed (99%), avg score 0.889

Per-type breakdown:
  IoT      20/20 passed, avg=0.988
  FMSR     40/40 passed, avg=0.907
  WO       34/36 passed, avg=0.801
  TSFM     23/23 passed, avg=0.920
  Multi    20/20 passed, avg=0.877

Required data files (from the AssetOpsBench clone):

src/tmp/assetopsbench/scenarios/single_agent/*.json — scenario definitions
src/tmp/assetopsbench/scenarios/multi_agent/*.json — multi-agent scenarios
src/tmp/assetopsbench/sample_data/event.csv — 6,256 unified events
src/servers/fmsr/failure_modes.yaml — failure mode definitions
src/couchdb/sample_data/ — sensor metadata JSON

Reproducing the GPT-4o Baseline (85%, avg 0.602)

This runs the same 40 custom scenarios against GPT-4o with no graph access — flat data only.

# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."

# Run the baseline
python -m benchmark.run_baseline --output results/my_baseline_results.json

# With a different model
python -m benchmark.run_baseline --model gpt-4-turbo --output results/baseline_turbo.json

What happens:

Loads the same 40 scenarios
Sends each scenario description to GPT-4o with a system prompt stating no graph/vector tools are available
Scores responses using the same 8-dimensional framework
Prints comparison summary

Note: Results may vary slightly between runs due to LLM non-determinism. Our published baseline (34/40, avg 0.602) was recorded on 2026-03-11.

Reproducing the NLQ Benchmark (83%, avg 0.789)

GPT-4o generates Cypher queries against the same knowledge graph used by the deterministic handlers. Note: IBM used GPT-4 (not GPT-4o), so comparisons between NLQ and IBM's baseline involve both a model difference and a data layer difference. Use --model gpt-4 for a true same-model comparison.

# Set your OpenAI API key
export OPENAI_API_KEY="sk-..."

# Run all 139 scenarios via NLQ
python -m benchmark.run_nlq --output results/my_nlq_results.json

# Run a single category
python -m benchmark.run_nlq --category fmsr

# Use a different provider (if supported)
python -m benchmark.run_nlq --provider anthropic --output results/nlq_claude.json

What happens:

Creates an embedded Samyama graph and runs the IBM ETL pipeline (same as run_ibm_scenarios)
For each scenario, sends the question + graph schema + few-shot examples to GPT-4o
GPT-4o generates a Cypher query; the runner executes it against the graph
If execution fails, the error is fed back to GPT-4o for retry (up to 2 retries)
GPT-4o synthesizes a natural language answer from the query results
Scores responses against IBM's characteristic_form ground truth

Expected output:

Summary: 115/139 passed (83%), avg score 0.789

Per-type breakdown:
  IoT      17/20 passed, avg=0.742
  FMSR     37/40 passed, avg=0.880
  WO       32/36 passed, avg=0.723
  TSFM     21/23 passed, avg=0.936
  Multi     8/20 passed, avg=0.605

Note: Multi stays at 40% because 12/20 Multi scenarios require TSFM pipeline execution (forecasting, anomaly detection) that cannot be expressed as Cypher queries. This is a structural limitation, not a prompt engineering problem.

Running Tests

# All tests
pytest tests/ -v

# Schema validation (18 tests)
pytest tests/test_schema.py -v

# Scenario JSON validation (30+ tests)
pytest tests/test_scenarios.py -v

Tests validate:

Graph schema file exists and contains all 11 node labels + 16 edge types
All 40 scenarios have required fields, valid IDs, correct categories
No duplicate scenario IDs
Expected tools reference known tool names

Starting the MCP Server

For integrating with Claude or other LLM agents via the Model Context Protocol:

# Start MCP server (stdio transport)
python -m mcp_server.server

The server exposes 9 tools:

query_sites, query_assets, query_sensors — asset hierarchy queries
query_failure_modes, find_similar_failures — failure mode lookup + vector search
impact_analysis, dependency_chain — graph traversal
criticality_ranking, maintenance_clusters — graph algorithms

Project Structure

assetops-kg/
├── schema/
│   └── industrial_kg.cypher      # Graph schema (11 node labels, 16 edge types)
├── etl/
│   ├── loader.py                 # Custom 40-scenario ETL (5 steps)
│   ├── ibm_loader.py             # IBM 139-scenario ETL (8 steps)
│   ├── eamlite_loader.py         # EAMLite → Site, Location, Equipment
│   ├── couchdb_loader.py         # CouchDB JSON → Sensor + SensorReading
│   ├── fmsr_loader.py            # YAML → FailureMode + MONITORS edges
│   ├── workorder_loader.py       # CSV → WorkOrder nodes
│   └── embedding_gen.py          # sentence-transformers → vector index
├── benchmark/
│   ├── run_samyama.py            # Custom 40 scenarios (graph-native)
│   ├── run_ibm_scenarios.py      # IBM's original 139 scenarios
│   ├── run_baseline.py           # GPT-4o baseline (requires OPENAI_API_KEY)
│   └── run_nlq.py                # NLQ benchmark — LLM generates Cypher (requires OPENAI_API_KEY)
├── evaluation/
│   ├── extended_criteria.py      # 8-dimensional scoring framework
│   └── runner.py                 # Scenario loader + output formatter
├── mcp_server/
│   ├── server.py                 # FastMCP server entry point
│   └── tools/                    # 4 tool modules (asset, failure, impact, analytics)
├── scenarios/                    # 40 scenario JSONs (7 categories)
├── results/                      # Benchmark result JSONs (v1-v5)
├── docs/
│   ├── results.md                # Full benchmark analysis
│   ├── methodology.md            # Scoring methodology
│   └── getting-started.md        # This file
└── tests/                        # pytest: schema + scenario validation

Troubleshooting

`ModuleNotFoundError: No module named 'samyama'`

The Samyama Python SDK requires building from Rust source via maturin. Make sure you have Rust installed and ran pip install -e ".[dev]".

IBM scenarios show 0 events for work order queries

Ensure the event.csv file exists at <AssetOpsBench>/src/tmp/assetopsbench/sample_data/event.csv. This is the canonical source for unified event counts (6,256 events).

GPT-4o baseline scores differ from published results

LLM responses are non-deterministic. Scores may vary by ±5% between runs. The published results were recorded with gpt-4o on 2026-03-11.

Build fails on Apple Silicon

If maturin fails to build the Rust extension, try:

pip install maturin
cd ../samyama-graph/sdk/python
maturin develop --release
cd ../../assetops-kg
pip install -e ".[dev]"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Started — Reproducing Results

Prerequisites

Setup

1. Clone the repositories

2. Create a virtual environment and install dependencies

3. Verify installation

Reproducing the Custom 40 Scenarios (Samyama-KG: 100%, avg 0.927)

Reproducing IBM's 139 Scenarios (Samyama-KG: 99%, avg 0.889)

Reproducing the GPT-4o Baseline (85%, avg 0.602)

Reproducing the NLQ Benchmark (83%, avg 0.789)

Running Tests

Starting the MCP Server

Project Structure

Troubleshooting

`ModuleNotFoundError: No module named 'samyama'`

IBM scenarios show 0 events for work order queries

GPT-4o baseline scores differ from published results

Build fails on Apple Silicon

FilesExpand file tree

getting-started.md

Latest commit

History

getting-started.md

File metadata and controls

Getting Started — Reproducing Results

Prerequisites

Setup

1. Clone the repositories

2. Create a virtual environment and install dependencies

3. Verify installation

Reproducing the Custom 40 Scenarios (Samyama-KG: 100%, avg 0.927)

Reproducing IBM's 139 Scenarios (Samyama-KG: 99%, avg 0.889)

Reproducing the GPT-4o Baseline (85%, avg 0.602)

Reproducing the NLQ Benchmark (83%, avg 0.789)

Running Tests

Starting the MCP Server

Project Structure

Troubleshooting

ModuleNotFoundError: No module named 'samyama'

IBM scenarios show 0 events for work order queries

GPT-4o baseline scores differ from published results

Build fails on Apple Silicon

`ModuleNotFoundError: No module named 'samyama'`