X-Ray: Decision Observability for Multi-Step Systems

Introduction:

Modern software systems don’t fail loudly , they fail silently and logically.

In many real-world pipelines (LLMs, heuristics, ranking logic), systems technically work while producing the wrong outcome. A recommendation feels off. A competitor match looks irrelevant. A filter removes something it shouldn’t have.

Traditional logs tell us what executed, but not why a specific decision was made.

When a final result is wrong, engineers are forced to reverse-engineer intent from scattered logs, rerun pipelines, or manually inspect intermediate data. This slows iteration, increases cognitive load, and makes non-deterministic systems hard to trust.

X-Ray is designed to solve this gap.

Instead of treating decision logic as opaque, X-Ray makes decision-making itself observable — capturing not just inputs and outputs, but the reasoning, filters, eliminations, and selections at every step.

What This System Does

X-Ray provides end-to-end visibility into multi-step decision pipelines by:

Capturing structured decision data at each step

Preserving context, reasoning, and failure explanations

Storing execution trails in a queryable JSON format

Visualizing the full decision flow in a developer-friendly dashboard

The demo simulates a competitor product selection system, but the architecture is intentionally general-purpose and reusable for:

Recommendation engines

Lead scoring systems

Content ranking pipelines

LLM-based evaluations

Any multi-step, non-deterministic decision process

Architecture Overview

The system is split into three cleanly separated layers:

X-Ray SDK (Instrumentation Layer)

A lightweight wrapper integrated into application logic that records step-level decision data:

Inputs and outputs

Candidate evaluations

Filters applied

Reasoning and explanations

Emits structured JSON logs for each execution.

Execution Log (Data Layer)

A single JSON file represents a full decision trace

Each step is self-contained and reconstructable

Designed to be human-readable and machine-queryable

Dashboard (Visualization Layer)

Reads X-Ray logs and renders them visually, allowing engineers to:

Inspect each step independently

Compare passed vs failed candidates

Identify where and why decisions diverged

Optimized for debugging, not presentation.

Code Files Overview

xray.py – Core SDK for recording decision steps. Handles:
- Recording step name, inputs, outputs, reasoning, and candidate evaluations
- Persisting structured JSON logs (xray_log.json)
- Non-blocking operation to avoid pipeline failures if logging fails
demo_pipeline.py – Sample multi-step pipeline demonstrating X-Ray in action. Includes:
- Keyword generation (mock LLM step)
- Candidate search (mock API)
- Relevance evaluation
- Filtering and ranking by business constraints
- Selection of top candidate
- Automatic recording of each step to X-Ray logs
dashboard.py – Streamlit dashboard to visualize X-Ray logs:
- Step-by-step expanders with input/output side-by-side
- Candidate tables with pass/fail highlighting
- Top candidate marked visually
- Interactive filtering by pass/fail status
- Summary metrics and bar chart visualization
xray_log.json – Example log generated by demo_pipeline.py showing the full decision trace with reasoning and candidate-level evaluations.

Note: The system currently writes JSON logs for simplicity; in production, logs could be streamed to a database for cross-pipeline queries and large-scale sampling.

Pipeline Step Inputs & Outputs:

Each step in the pipeline is recorded using the X-Ray SDK. Every step captures:

Step Input (step_in)
- The raw data or parameters the step works on
- Example:

{
  "title": "Stainless Steel Water Bottle 32oz Insulated",
  "category": "Sports & Outdoors",
  "price": 29.99,
  "rating": 4.2,
  "reviews": 1247
}
Step Output (step_out)

The results produced by the step

Example:

json
{
  "keywords": ["stainless steel water bottle insulated", "vacuum insulated bottle 32oz", "sports water bottle"],
  "model": "mock-llm"
}
Candidate Evaluations (evaluations) – optional, for steps producing candidates

Includes per-candidate metrics, pass/fail status, and failure reasons

Example:

json
{
  "asin": "B0COMP11",
  "title": "Mock Product 11",
  "metrics": {"price": 55.02, "rating": 4.8, "reviews": 9917},
  "qualified": true,
  "fail_reasons": []
}
Reasoning:
Human-readable explanation of what the step does or why decisions were made

Example:

arduino
Copy code
"Filtered and ranked candidates by review count, rating, price, relevance"
This design ensures every step in the multi-step pipeline is traceable, explainable, and debuggable, giving developers full visibility into why each decision was made.


# Application Logic
      ↓
# X-Ray SDK (Decision Capture)
      ↓
# Structured JSON Execution Log
      ↓
# Streamlit Dashboard (Decision Visualization)

# Tech Stack

Language: Python (readability, fast prototyping)

SDK: Custom lightweight Python module

Data Format: JSON (transparent, portable, extensible)

Dashboard: Streamlit (fast internal tooling)

Charts: Plotly (interactive summaries)

Data Handling: Pandas (clean transformations)

The stack is intentionally chosen to optimize clarity and debuggability, not over-engineering.

# Key Design Decisions:

# JSON instead of a database
Enables fast iteration and easy inspection without premature persistence complexity.

# Streamlit for UI
Keeps focus on decision observability rather than frontend plumbing.

# Mock data instead of real integrations
Keeps the demo focused on system design, not API reliability.

# Simple SDK abstraction
Encourages reuse across domains without tight coupling.

System Components
Decision Pipeline

# A simulated multi-step pipeline mirroring real AI-driven systems:

Keyword generation (intent extraction)

Candidate search

Relevance evaluation (mock LLM-style logic)

Filtering and ranking via business constraints

Each step produces structured input, output, and reasoning.

X-Ray SDK

# Core responsibilities:

Record decision steps

Persist structured logs

Attach optional candidate-level evaluations

# Design choices:

Schema-flexible JSON logging

No dependency on LLMs, databases, or frameworks

Dashboard

Built for inspection, not vanity metrics.

# Capabilities:

Chronological step expanders

Side-by-side input/output views

Candidate tables with pass/fail status

Interactive filtering

Visual summaries for quick diagnosis

Data Model (Simplified)
# Decision Step

Step name

Timestamp

Input payload

Output payload

Reasoning text

Optional candidate evaluations

Candidate Evaluation

Unique identifier (asin)

Title

Metrics (price, rating, reviews, relevance)

Qualified status (true / false)

Failure reasons (if any)

# Visual encoding:

Green → qualified

Red → failed

Gold → final selection

# Setup & Usage
# Prerequisites:

Python 3.9+

pip

Installation
pip install streamlit pandas plotly

# Step 1: Run the Decision Pipeline
python demo_pipeline.py


Generates xray_log.json with the full execution trace.

# Step 2: Launch the Dashboard
streamlit run dashboard.py

Typical Debugging Flow

Start at the final selected output

Walk backward through earlier steps

Inspect eliminations and failure reasons

Identify whether issues came from:

Keyword generation

Candidate retrieval

Filtering thresholds

Ranking logic

# Known Limitations & Future Improvements

This implementation is intentionally lightweight. Planned improvements include:

Persistent storage backends (S3, MongoDB, ClickHouse)

Execution IDs for multi-run comparison

Diff views between pipeline executions

Asynchronous event streaming for high-volume pipelines

Sampling strategies for large candidate sets

Alerting on abnormal decision patterns

API-backed dashboard with saved views

Authentication & access control

LLM-native reasoning capture (token-level or prompt traces)

# Why This Matters

# Early-stage systems fail subtly:

Logic drifts

Heuristics accumulate

LLM behavior changes silently

# X-Ray provides:

Faster feedback loops

Confidence in automated decisions

A shared debugging language across engineering, product, and data teams

# Conclusion

This prototype demonstrates decision observability using a competitor selection workflow, but the architecture is intentionally general-purpose.

X-Ray helps engineers see not just what a system decided, but why — enabling trust, faster iteration, and prevention of silent failures before they compound.

This is tooling built for engineers: lightweight, extensible, and focused on clarity over vanity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

X-Ray: Decision Observability for Multi-Step Systems

Introduction:

What This System Does

The demo simulates a competitor product selection system, but the architecture is intentionally general-purpose and reusable for:

Architecture Overview

The system is split into three cleanly separated layers:

Code Files Overview

Pipeline Step Inputs & Outputs:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
dashboard.py		dashboard.py
demo_pipeline.py		demo_pipeline.py
xray.py		xray.py
xray_log.json		xray_log.json

Folders and files

Latest commit

History

Repository files navigation

X-Ray: Decision Observability for Multi-Step Systems

Introduction:

What This System Does

The demo simulates a competitor product selection system, but the architecture is intentionally general-purpose and reusable for:

Architecture Overview

The system is split into three cleanly separated layers:

Code Files Overview

Pipeline Step Inputs & Outputs:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages