Modern software systems don’t fail loudly , they fail silently and logically.
In many real-world pipelines (LLMs, heuristics, ranking logic), systems technically work while producing the wrong outcome. A recommendation feels off. A competitor match looks irrelevant. A filter removes something it shouldn’t have.
Traditional logs tell us what executed, but not why a specific decision was made.
When a final result is wrong, engineers are forced to reverse-engineer intent from scattered logs, rerun pipelines, or manually inspect intermediate data. This slows iteration, increases cognitive load, and makes non-deterministic systems hard to trust.
X-Ray is designed to solve this gap.
Instead of treating decision logic as opaque, X-Ray makes decision-making itself observable — capturing not just inputs and outputs, but the reasoning, filters, eliminations, and selections at every step.
X-Ray provides end-to-end visibility into multi-step decision pipelines by:
Capturing structured decision data at each step
Preserving context, reasoning, and failure explanations
Storing execution trails in a queryable JSON format
Visualizing the full decision flow in a developer-friendly dashboard
The demo simulates a competitor product selection system, but the architecture is intentionally general-purpose and reusable for:
Recommendation engines
Lead scoring systems
Content ranking pipelines
LLM-based evaluations
Any multi-step, non-deterministic decision process
- X-Ray SDK (Instrumentation Layer)
A lightweight wrapper integrated into application logic that records step-level decision data:
Inputs and outputs
Candidate evaluations
Filters applied
Reasoning and explanations
Emits structured JSON logs for each execution.
- Execution Log (Data Layer)
A single JSON file represents a full decision trace
Each step is self-contained and reconstructable
Designed to be human-readable and machine-queryable
- Dashboard (Visualization Layer)
Reads X-Ray logs and renders them visually, allowing engineers to:
Inspect each step independently
Compare passed vs failed candidates
Identify where and why decisions diverged
Optimized for debugging, not presentation.
-
xray.py – Core SDK for recording decision steps. Handles:
- Recording step name, inputs, outputs, reasoning, and candidate evaluations
- Persisting structured JSON logs (
xray_log.json) - Non-blocking operation to avoid pipeline failures if logging fails
-
demo_pipeline.py – Sample multi-step pipeline demonstrating X-Ray in action. Includes:
- Keyword generation (mock LLM step)
- Candidate search (mock API)
- Relevance evaluation
- Filtering and ranking by business constraints
- Selection of top candidate
- Automatic recording of each step to X-Ray logs
-
dashboard.py – Streamlit dashboard to visualize X-Ray logs:
- Step-by-step expanders with input/output side-by-side
- Candidate tables with pass/fail highlighting
- Top candidate marked visually
- Interactive filtering by pass/fail status
- Summary metrics and bar chart visualization
-
xray_log.json – Example log generated by
demo_pipeline.pyshowing the full decision trace with reasoning and candidate-level evaluations.
Note: The system currently writes JSON logs for simplicity; in production, logs could be streamed to a database for cross-pipeline queries and large-scale sampling.
Each step in the pipeline is recorded using the X-Ray SDK. Every step captures:
- Step Input (
step_in)- The raw data or parameters the step works on
- Example:
{
"title": "Stainless Steel Water Bottle 32oz Insulated",
"category": "Sports & Outdoors",
"price": 29.99,
"rating": 4.2,
"reviews": 1247
}
Step Output (step_out)
The results produced by the step
Example:
json
{
"keywords": ["stainless steel water bottle insulated", "vacuum insulated bottle 32oz", "sports water bottle"],
"model": "mock-llm"
}
Candidate Evaluations (evaluations) – optional, for steps producing candidates
Includes per-candidate metrics, pass/fail status, and failure reasons
Example:
json
{
"asin": "B0COMP11",
"title": "Mock Product 11",
"metrics": {"price": 55.02, "rating": 4.8, "reviews": 9917},
"qualified": true,
"fail_reasons": []
}
Reasoning:
Human-readable explanation of what the step does or why decisions were made
Example:
arduino
Copy code
"Filtered and ranked candidates by review count, rating, price, relevance"
This design ensures every step in the multi-step pipeline is traceable, explainable, and debuggable, giving developers full visibility into why each decision was made.
# Application Logic
↓
# X-Ray SDK (Decision Capture)
↓
# Structured JSON Execution Log
↓
# Streamlit Dashboard (Decision Visualization)
# Tech Stack
Language: Python (readability, fast prototyping)
SDK: Custom lightweight Python module
Data Format: JSON (transparent, portable, extensible)
Dashboard: Streamlit (fast internal tooling)
Charts: Plotly (interactive summaries)
Data Handling: Pandas (clean transformations)
The stack is intentionally chosen to optimize clarity and debuggability, not over-engineering.
# Key Design Decisions:
# JSON instead of a database
Enables fast iteration and easy inspection without premature persistence complexity.
# Streamlit for UI
Keeps focus on decision observability rather than frontend plumbing.
# Mock data instead of real integrations
Keeps the demo focused on system design, not API reliability.
# Simple SDK abstraction
Encourages reuse across domains without tight coupling.
System Components
Decision Pipeline
# A simulated multi-step pipeline mirroring real AI-driven systems:
Keyword generation (intent extraction)
Candidate search
Relevance evaluation (mock LLM-style logic)
Filtering and ranking via business constraints
Each step produces structured input, output, and reasoning.
X-Ray SDK
# Core responsibilities:
Record decision steps
Persist structured logs
Attach optional candidate-level evaluations
# Design choices:
Schema-flexible JSON logging
No dependency on LLMs, databases, or frameworks
Dashboard
Built for inspection, not vanity metrics.
# Capabilities:
Chronological step expanders
Side-by-side input/output views
Candidate tables with pass/fail status
Interactive filtering
Visual summaries for quick diagnosis
Data Model (Simplified)
# Decision Step
Step name
Timestamp
Input payload
Output payload
Reasoning text
Optional candidate evaluations
Candidate Evaluation
Unique identifier (asin)
Title
Metrics (price, rating, reviews, relevance)
Qualified status (true / false)
Failure reasons (if any)
# Visual encoding:
Green → qualified
Red → failed
Gold → final selection
# Setup & Usage
# Prerequisites:
Python 3.9+
pip
Installation
pip install streamlit pandas plotly
# Step 1: Run the Decision Pipeline
python demo_pipeline.py
Generates xray_log.json with the full execution trace.
# Step 2: Launch the Dashboard
streamlit run dashboard.py
Typical Debugging Flow
Start at the final selected output
Walk backward through earlier steps
Inspect eliminations and failure reasons
Identify whether issues came from:
Keyword generation
Candidate retrieval
Filtering thresholds
Ranking logic
# Known Limitations & Future Improvements
This implementation is intentionally lightweight. Planned improvements include:
Persistent storage backends (S3, MongoDB, ClickHouse)
Execution IDs for multi-run comparison
Diff views between pipeline executions
Asynchronous event streaming for high-volume pipelines
Sampling strategies for large candidate sets
Alerting on abnormal decision patterns
API-backed dashboard with saved views
Authentication & access control
LLM-native reasoning capture (token-level or prompt traces)
# Why This Matters
# Early-stage systems fail subtly:
Logic drifts
Heuristics accumulate
LLM behavior changes silently
# X-Ray provides:
Faster feedback loops
Confidence in automated decisions
A shared debugging language across engineering, product, and data teams
# Conclusion
This prototype demonstrates decision observability using a competitor selection workflow, but the architecture is intentionally general-purpose.
X-Ray helps engineers see not just what a system decided, but why — enabling trust, faster iteration, and prevention of silent failures before they compound.
This is tooling built for engineers: lightweight, extensible, and focused on clarity over vanity.