Experiment Backlog

Format: - [ ] <slug> - <one-line description>. Mark [x] when finished. Pick the topmost unchecked item unless a later one is clearly more timely.

Backlog

mcp-tool-latency-bench - Micro-benchmark measuring tool-call overhead across MCP transports (stdio vs SSE vs HTTP).
mcp-schema-linter - Lint MCP server schemas against the spec; flag common mistakes (missing descriptions, unclear param names, over-broad types).
mcp-prompt-injection-corpus - Small curated corpus of adversarial tool-arg payloads with a scorer that reports how often the target model passes them through unmodified.
agent-trajectory-viz - Parse a Claude / OpenAI transcript and render the tool-call trajectory as a Mermaid diagram.
agent-reliability-eval - Tiny eval framework for the reliability patterns I catalogued in my reliability-patterns paper (retry, fallback, external grounding, etc.) on a toy task.
eu-ai-act-tier-classifier - Rule-based classifier that takes a use-case description and returns the EU AI Act risk tier with citations to the article.
iso-42001-checklist-gen - Generator turning a YAML control spec into an auditor-friendly Markdown checklist aligned with ISO 42001.
nist-rmf-mapper - Map NIST AI RMF functions/categories to concrete code artefacts (tests, logs, cards) with a small demo mapping.
llm-routing-playground - Toy cost/latency-aware router between two mock model backends with a visualiser for routing decisions.
rag-robustness-eval - Inject controlled noise into a retrieval corpus and measure recall@k degradation on a toy dataset.
agent-memory-comparator - Same toy workload across three memory backends (vector / graph / KV); compare retrieval quality and token cost.
xai-explanation-diff - Diff SHAP vs LIME explanations on a shared tabular classifier; surface cases where they disagree.
tool-use-hallucination-detector - Flag inconsistent tool chains in agent transcripts (e.g., reading a file the agent never wrote).
openapi-to-mcp-gen - Toy generator that takes an OpenAPI spec and emits a minimal MCP server stub.
prompt-redactor - Regex + lightweight classifier pipeline that strips obvious PII / secrets from prompts before they hit a model.
consensus-sim - Minimal multi-agent consensus simulator (Byzantine-fault-tolerant voting among N agents with disagreement).
llm-cost-cli - CLI that estimates cost per request across providers given model, input tokens, output tokens.
agent-failure-taxonomy - Structured YAML taxonomy of observed agent failure modes + a small visualiser.
prompt-version-tool - Git-native tool for versioning prompts with diffable test cases per version.
synthetic-audit-trail - Generator for synthetic but realistic AI-system audit trails useful for compliance-tool testing.
tool-graph-score - Implementation of the Tool Graph Capability Score metric from my MCP governance paper on a small dataset.
mcp-transport-resilience - Fault-injection harness that drops / reorders messages on an MCP connection and measures recovery behaviour.
agent-policy-dsl - Tiny DSL for declaring agent tool-use policies with a runtime enforcer.
prompt-contrast-evals - Generate minimally-contrasting prompt pairs (one word changed) and measure output sensitivity as a brittleness proxy.
tool-description-ablation - Ablate fields in MCP tool descriptions; measure agent success-rate delta to identify which fields actually matter.

Monthly standalone-repo themes

When the monthly trigger fires, pick a larger theme that deserves its own public repo (not a monorepo subfolder):

mcp-schema-auditor - Static MCP schema auditor (lint + security + score), created as mcp-schema-auditor. Complements runtime mcp-governance-kit.
ai-act-navigator - Interactive CLI that walks you through the EU AI Act risk classification with citations.
reliability-patterns-playground - Runnable examples of each reliability pattern from the paper with adversarial test cases.
agent-trace-viewer - Full web UI (static) that turns agent JSONL traces into browsable trajectories.
iso-42001-workbench - Opinionated scaffolding for organisations starting an ISO 42001 implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment Backlog

Backlog

Monthly standalone-repo themes

FilesExpand file tree

ideas.md

Latest commit

History

ideas.md

File metadata and controls

Experiment Backlog

Backlog

Monthly standalone-repo themes