-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Hi Guidance team,
I’ve been running a series of experiments to test whether a general-purpose LLM can behave like a deterministic structured runtime when placed under strong constraints — without any external tools, APIs, plugins, or execution frameworks.
Surprisingly, the results were stable and fully reproducible, so I'm sharing them here in case they are relevant to your work on structured prompting and constrained generation.
🚀 Summary of the experiment
I built a miniature Flight Readiness Review (FRR) Runtime that forces ChatGPT into an 8-step deterministic pipeline:
Input parsing
Normalization
Factor engine (F1–F12)
Global RiskMode
Subsystem evaluation
KernelBus arbitration
Counterfactual reasoning
A strict FRR_Result block (no free-form output allowed)
Key property:
Same input → same output
(zero drift, zero narrative expansion)
This emergent deterministic behavior is what caught my attention.
📡 Reproducible Test Scenarios
I ran the runtime against several historical-style telemetry snapshots (e.g., cold O-rings, COPV instability, wind-shear cases).
Even though these are not aerospace simulations, the behavior was consistently deterministic:
Stable factor vectors
Stable subsystem arbitration
Stable final decision
No deviation across runs
This reminded me of the constraints & patterns that Guidance tries to formalize.
🎥 Demo Video (3 minutes)
A short screen recording of the deterministic FRR runtime running inside the ChatGPT client:
📦 GitHub Repo (safe, prompt-only)
Spec + soft-system prompt + sample telemetry inputs:
https://github.com/yuer-dsl/qtx-frr-runtime
🔍 Why post this here
This is not a feature request — only an observation:
Strong structural constraints appear to induce deterministic, pipeline-like execution behavior inside an LLM, even without tools.
Given Guidance’s focus on:
structured output
constrained LLM execution
reproducible reasoning
multi-step control flows
I thought this phenomenon might be of interest for future discussions or evaluation benchmarks.
Happy to provide simplified test cases or a reduced prompt if helpful.
Thanks!