Blackbox

Nightly CI for Coding Agents — Capture, replay, evaluate, and automatically improve your AI coding assistant rules.

Overview

Blackbox helps you improve your AI coding assistant by:

Capturing real LLM calls during development
Replaying them against local models overnight
Evaluating quality, detecting loops and issues
Improving rules automatically with regression gating
Shipping changes as GitHub PRs

Quick Start

Prerequisites

Bun 1.0+
Docker & Docker Compose
Ollama (for local model replay)
Rust (for desktop app only)

Installation

# Clone and install
git clone https://github.com/yourorg/blackbox.git
cd blackbox
bun install
bun run build

# Start infrastructure
bun run docker:up

# Pull a local model
ollama pull llama3.2:3b

# Check services
node packages/cli/dist/index.js status

Basic Usage

# 1. Capture traces (integrate SDK into your agent)
# See "Capture SDK" section below

# 2. Replay against local model
node packages/cli/dist/index.js replay -i ./traces -m llama3.2:3b

# 3. Evaluate traces
node packages/cli/dist/index.js evaluate -i ./traces

# 4. Generate improvements
node packages/cli/dist/index.js improve -t ./traces -e ./eval-results -r ./CLAUDE.md

# 5. Run full pipeline
node packages/cli/dist/index.js run -i ./traces --create-pr

Project Structure

blackbox/
├── packages/           # Core library packages
│   ├── shared/         # Shared types, schemas, utilities
│   ├── capture/        # SDK wrapper for capturing LLM calls
│   ├── replay/         # Replay engine for local model testing
│   ├── evaluate/       # Evaluation framework with Phoenix
│   ├── improve/        # Rules analysis and improvement
│   ├── pr-generator/   # Git/GitHub PR creation
│   └── cli/            # Command-line interface
├── apps/
│   └── desktop/        # Tauri desktop menu bar app
├── examples/           # Example agents, traces, and rules
└── tests/              # Integration tests

Packages

Package	Description
`@blackbox/shared`	Core types and utilities
`@blackbox/capture`	SDK wrapper for capturing LLM calls
`@blackbox/replay`	Replay engine for local model testing
`@blackbox/evaluate`	Evaluation framework with Phoenix integration
`@blackbox/improve`	Rules analysis and improvement generation
`@blackbox/pr-generator`	Git/GitHub PR creation and automation
`@blackbox/cli`	Command-line interface
`@blackbox/desktop`	Tauri desktop menu bar app

Capture SDK

Integrate capture into your coding agent:

import { createCaptureClient } from "@blackbox/capture";

// Create a capture-enabled OpenAI client
const client = createCaptureClient(
  { apiKey: process.env.OPENAI_API_KEY },
  {
    langfuse: {
      host: "http://localhost:3213",
      publicKey: process.env.LANGFUSE_PUBLIC_KEY,
      secretKey: process.env.LANGFUSE_SECRET_KEY,
    },
  },
);

// Use like regular OpenAI SDK - calls are automatically captured
const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [{ role: "user", content: "Help me fix this bug" }],
  tools: myTools,
});

CLI Commands

`blackbox status`

Check health of all services.

`blackbox capture`

Set up trace capture for your application.

`blackbox replay`

Replay captured traces against local models.

node packages/cli/dist/index.js replay \
  -i ./traces \
  -o ./replay-results \
  -m llama3.2:3b \
  --mode semi-live

`blackbox evaluate`

Evaluate traces for quality and issues.

node packages/cli/dist/index.js evaluate \
  -i ./traces \
  -o ./eval-results

`blackbox improve`

Generate rule improvements from analysis.

node packages/cli/dist/index.js improve \
  -t ./traces \
  -e ./eval-results \
  -r ./CLAUDE.md \
  --model gpt-4o-mini

`blackbox run`

Run the full pipeline.

node packages/cli/dist/index.js run \
  -i ./traces \
  -r ./CLAUDE.md \
  --create-pr \
  --github-token $GITHUB_TOKEN \
  --github-owner yourorg \
  --github-repo yourrepo

Desktop Application

Blackbox includes a Tauri v2-based desktop menu bar app for macOS. The app lives in your menu bar for quick access to settings and controls.

Key Desktop Features

Menu bar integration: Lives in the macOS menu bar (system tray)
Global hotkey: Open Blackbox with ⌘Space (customizable)
Settings panel: Configure launch at login, global hotkey, appearance (light/dark/system)
Minimal footprint: No Dock icon, lightweight native performance

Tray Menu

The system tray provides quick access to:

Open Blackbox - Main app window
Documentation - Manual and troubleshooting guides
Community - Slack, X (Twitter), YouTube links
Settings - App preferences
About & Updates - Version info and update checking

Tech Stack

Tauri v2: Native Rust backend with web frontend
React 18: UI framework with React Router
TypeScript: Type-safe frontend code
Tailwind CSS v4: CSS-first configuration, utility classes
Vite: Fast development and building

Desktop Development

cd apps/desktop
bun install
bun run tauri:dev    # Development mode with hot reload
bun run tauri:build  # Build for production
bun run test         # Run frontend tests

Rust Backend Development

cd apps/desktop/src-tauri
cargo check          # Type check Rust code
cargo test           # Run Rust unit tests
cargo clippy         # Lint Rust code

Desktop Project Layout

apps/desktop/
├── src/                    # React frontend
│   ├── App.tsx             # Main app with routing
│   ├── App.css             # Tailwind CSS configuration
│   ├── components/         # UI components
│   │   ├── ui/             # Reusable UI primitives (Button, Input, etc.)
│   │   ├── settings/       # Settings-specific components
│   │   └── theme-provider.tsx  # Theme management (light/dark/system)
│   ├── views/              # Page views (MainView, SettingsView)
│   └── lib/                # Utilities and Tauri commands
├── src-tauri/              # Rust backend
│   ├── src/lib.rs          # Tray menu, commands, window management
│   ├── tauri.conf.json     # Tauri configuration
│   └── capabilities/       # Permission capabilities
├── vite.config.ts          # Vite + Tailwind configuration
└── package.json            # Frontend dependencies

Infrastructure

Blackbox uses these services (via Docker Compose):

Service	Port	Purpose
Langfuse	3213	LLM observability and tracing
Phoenix	6013	ML evaluation platform
LiteLLM	4213	AI gateway for model routing
Ollama	11434	Local model serving
PostgreSQL	5413	Database for Langfuse
ClickHouse	8113	Analytics for Langfuse
Redis	6313	Caching
MinIO	9014	Object storage

Start all services:

bun run docker:up

Evaluation

Blackbox includes several evaluators:

Loop Detection

Identifies stuck patterns:

Repeated tool calls with same arguments
Oscillation between states
Stalled retrieval attempts
Circular reasoning

Tool Efficiency

Measures tool usage effectiveness:

Success rate
Redundant calls
Error recovery

LLM Judge (optional)

Uses an LLM to evaluate response quality.

Rules Improvement

The improvement engine:

Analyzes traces to find failure patterns
Identifies loop patterns and rule violations
Generates improvement opportunities
Creates new or modified rules using LLM
Validates improvements don't cause regressions
Ships changes as reviewed PRs

Configuration

Create a .env file (see .env.example):

# OpenAI (for capture and improvement generation)
OPENAI_API_KEY=sk-...

# Langfuse (for tracing)
LANGFUSE_HOST=http://localhost:3213
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...

# GitHub (for PR creation)
GITHUB_TOKEN=ghp_...

# Model settings
BLACKBOX_REPLAY_MODEL=llama3.2:3b
BLACKBOX_IMPROVE_MODEL=gpt-4o-mini

Development

# Install dependencies
bun install

# Build all packages
bun run build

# Run tests
bun run test

# Run integration tests
bun run test:integration

# Type check
bun run typecheck

# Lint and format
bun run check:fix

# Clean build artifacts
bun run clean

# Development mode (watch)
bun run dev

Architecture

┌─────────────────────────────────────────────────────────────┐
│                       Blackbox CLI                          │
├─────────────┬─────────────┬─────────────┬───────────────────┤
│   Capture   │   Replay    │  Evaluate   │    Improve        │
│     SDK     │   Engine    │  Pipeline   │   Generator       │
├─────────────┴─────────────┴─────────────┴───────────────────┤
│                    Shared Types & Utils                     │
├─────────────────────────────────────────────────────────────┤
│   Langfuse   │   Phoenix   │   LiteLLM   │    Ollama        │
│  (Tracing)   │(Evaluation) │  (Gateway)  │ (Local Models)   │
└─────────────────────────────────────────────────────────────┘

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude		.claude
.vscode		.vscode
apps/desktop		apps/desktop
examples		examples
packages		packages
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.toml		Cargo.toml
PLAN.md		PLAN.md
README.md		README.md
VISION.md		VISION.md
biome.json		biome.json
bun.lock		bun.lock
docker-compose.yml		docker-compose.yml
lefthook.yml		lefthook.yml
litellm-config.yaml		litellm-config.yaml
package.json		package.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Blackbox

Overview

Quick Start

Prerequisites

Installation

Basic Usage

Project Structure

Packages

Capture SDK

CLI Commands

blackbox status

blackbox capture

blackbox replay

blackbox evaluate

blackbox improve

blackbox run

Desktop Application

Key Desktop Features

Tray Menu

Tech Stack

Desktop Development

Rust Backend Development

Desktop Project Layout

Infrastructure

Evaluation

Loop Detection

Tool Efficiency

LLM Judge (optional)

Rules Improvement

Configuration

Development

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`blackbox status`

`blackbox capture`

`blackbox replay`

`blackbox evaluate`

`blackbox improve`

`blackbox run`

Packages