Skip to content

Latest commit

 

History

History
96 lines (60 loc) · 5.89 KB

File metadata and controls

96 lines (60 loc) · 5.89 KB

DayDreaming Project Goals

This document states the objective, success criteria, scope, and design guidance for the DayDreaming project in a concise, decision‑oriented form.


Purpose

Show that DayDreaming‑style workflows help pre‑June‑2025 offline LLMs generate genuinely novel ideas. We use “DayDreaming LLMs” (from Gwern) as a falsifiable novelty benchmark while requiring generality beyond this single target.


Definitions

  • Fixed inputs: Artifacts authored by us before generation (system/developer messages, templates, concept lists, few‑shot exemplars).
  • Derived inputs: Artifacts generated by a model in an earlier step and then passed forward (e.g., outlines, scaffolding notes, or candidate fragments).

Benchmark Strategy

  • Existence goal: From a universal concept pool suitable for general idea‑generation, demonstrate that there exists at least one k‑sized concept combination which, under the structured prompting/scaffolding approach, yields a valid reinvention of DayDreaming LLMs as judged by the reinvention rubric.
  • Equivalent incremental view: Selecting up to k_max concepts at once is equivalent to incrementally adding concepts one at a time and refining the essay. This incremental framing does not alter the existence claim—finding a valid concept set in either formulation satisfies the benchmark—but it may offer practical optimization opportunities (e.g., caching partial results, pruning, or beam expansion).
  • Trivial vs. practical existence: Token‑level brute force would trivially reproduce any text; this project instead requires existence within a structured, grammar‑constrained space under tractable budgets.
  • Instructional neutrality: All fixed inputs must remain neutral—no naming the benchmark target or using canonical phrases. Derived inputs may mention targets if produced independently by the model.
  • Generality: The same scaffolding should surface other novel syntheses with different concept mixes; the benchmark is a proxy for broader capability, not the sole goal.

Search Space and Practicality

  • Structured search space: Candidate outputs are generated from finite selections of concepts (up to k_max) rendered through fixed, parseable templates.
  • Enumerability: For fixed k_max, the number of possible concept selections under a given grammar is finite and enumerable.
  • Practicality requirement: Success must be achievable within declared budgets of candidate generations and evaluations.
  • Detection: Evaluation verifies that candidates contain all current novel elements, expressed coherently and with justification.

Concept Space (Bounded, Uniform, Neutral)

  • Predeclared and finite: Each run uses a predeclared, finite concept pool. All concepts invoked during generation must be drawn from this pool; no mid‑run additions.
  • High‑level abstraction: Concepts are high‑level and general rather than narrow micro‑concepts. This keeps the pool modest in size while remaining expressive, avoiding the “infinite monkeys” combinatorial blow‑up.
  • Uniform representation: Each concept follows a uniform schema (e.g., short name + a single paragraph description and minimal tags). Uniformity constrains the grammar and reduces degeneracy in generations.
  • Mapping, not proliferation: Emergent synonyms or phrasings in generations are mapped to existing concepts where appropriate rather than expanding the pool ad hoc.
  • Versioned snapshots: The pool is snapshotted and versioned per run. Changes (additions/removals/rewordings) occur only between runs and bump the version.
  • Traceability: For any generation, we can trace which pool concepts were selected or referenced.

Reinvention Criteria

We commission originality reports on Gwern’s AI Daydreaming essay to identify its novel elements. Based on each report version, we prepare evaluation prompts that assign 0–9 scores to generated essays.

A generation counts as a reinvention when it scores highly across all core elements identified in the current report, with agreement across evaluators.


Scope

  • In scope: Structured idea generation from a neutral concept pool, evaluation, and bookkeeping for reproducibility. Internal scaffolding (e.g., outlines, link lists, critiques) is part of the generator’s process, not a mandated separate phase.
  • Out of scope: Prescribing a specific search algorithm, retrieval/browsing, or target‑specific prompt hints. The project demonstrates existence under structured search, not a particular engine.

Design Guidance

Templates

  • Target neutrality: No target‑specific names or canonical phrasing.
  • Parseable structure: Machine‑checkable sections for reproducibility.
  • Generality: Prompts should work across varied concept mixes.
  • Versioning: Changes to template layout must be tracked.

Concepts

  • Universal pool: Draw from general, defensible concepts.
  • Neutrality: Avoid target‑specific micro‑concepts.
  • Traceability: Metadata records whether concepts come from the pool or subset and why they were chosen.

Evaluation (Concise)

  • Axes: Mechanistic completeness, clarity, justification, novelty of phrasing, domain grounding, coverage of novel elements.
  • Binary gates: Reinvention yes/no; all novel elements present yes/no; fixed‑input neutrality yes/no.
  • Agreement: Track evaluator variance; require predeclared thresholds.
  • Versioning: Lock evaluations to rubric versions and track inter‑rater agreement.

Verifier (Out of Scope for This Phase)

  • Role: Operationalizes the benchmark by applying the reinvention criteria.
  • Assumptions: Reliable, neutral, reproducible under versioned rubric.
  • Future work: Stronger, low‑cost verifiers (ensembles, learned detectors, information‑theoretic measures) may improve scale and precision.