This document explains how the Dagster implementation in src/daydreaming_dagster/
is wired together and how it consumes the cohort contract described in
docs/cohorts.md. Read the cohorts guide first—it is the
authoritative reference for the spec bundle, manifest, and membership.csv
contract as produced from the spec DSL in docs/spec_dsl.md.
The architecture notes below focus on how those artifacts are used by the
assets, resources, and helpers inside the codebase.
cohort spec bundle (docs/cohorts.md) raw catalogs (data/1_raw/*.csv)
│ │
▼ ▼
cohort asset group (group_cohorts.py) ───► raw_data assets (raw_data.py)
│ │
▼ ▼
membership.csv + metadata seeds catalog DataFrames
│
▼
dynamic partitions (partitions.py) ⇆ unified stage runner (unified/)
│ │
▼ ▼
draft / essay / evaluation gens (group_* assets + GensPromptIOManager)
│
▼
cohort-scoped reporting (results_processing.py, results_summary.py,
results_analysis.py, scripts/build_pivot_tables.py)
The cohort assets translate a declarative spec into deterministic gen_id
partitions and seed the gens store with metadata. Stage assets then materialize
per-partition runs through the unified runner, and reporting assets consume the
persisted artifacts using the cohort membership contract.
- Cohort-first orchestration. The cohort bundle is the only declarative
input. Assets in
group_cohorts.pycompile it intomembership.csv, which downstream code reads through helpers likeMembershipServiceResourceandCohortScope. No stage asset re-opens the spec or manifest directly. Seedocs/cohorts.mdfor the expected contents of the spec bundle anddocs/spec_dsl.mdfor how structured couplings and factorial axes are authored. - Deterministic partitions. Generation IDs are derived from the
stage-specific signatures indaydreaming_dagster.utils.ids. The IDs are written to disk and registered as Dagster dynamic partitions viaregister_cohort_partitions(assets/partitions.py). Re-materialising the same cohort with unchanged specs reuses the same IDs and yields DagsterSKIPPEDruns. - Filesystem-backed transparency. All prompts, raw LLM responses, parsed
outputs, and metadata live under
data/gens/<stage>/<gen_id>/using the canonical filenames exported bydaydreaming_dagster.data_layer.paths. This makes it possible to debug or reprocess artefacts offline. - Unified execution primitives. The stage assets call into the shared
runner (
src/daydreaming_dagster/unified/) so that prompt rendering, LLM execution, parsing, and persistence follow the same contract across draft, essay, and evaluation stages.
| Layer | Modules | Notes |
|---|---|---|
| Cohort compilation | assets/group_cohorts.py, cohorts/, cohorts.md |
Builds membership.csv, seeds metadata via seed_cohort_metadata, and writes manifest.json. Validates against raw catalogs (data/1_raw/*.csv). |
| Dynamic partitions | assets/partitions.py |
Registers per-stage gen_id partitions from the cohort membership DataFrame; add-only by design. |
| Stage execution | assets/group_draft.py, assets/group_essay.py, assets/group_evaluation.py, unified/stage_core.py |
Assets invoke the unified runner with IO managers from the stage registry (see below). Essay assets can operate in copy mode by passing pass_through_from to the runner. |
| Raw inputs | assets/raw_data.py |
Loads catalog CSVs and template text into Dagster assets. These are regular assets, not auto-refreshing source assets. |
| Reporting | assets/results_processing.py, assets/results_summary.py, assets/results_analysis.py, results_summary/transformations.py, scripts/build_pivot_tables.py |
Operate exclusively on persisted gens artefacts and cohort membership lookups. |
| Shared utilities | data_layer/paths.py, data_layer/gens_data_layer.py, resources/, utils/ |
Provide filesystem indirection, IO managers, membership lookups, and parsing helpers. |
src/daydreaming_dagster/definitions.py builds the Dagster Definitions object
by combining:
- Stage registry (
STAGES). Each stage entry enumerates its assets and exposes factories that build stage-specific IO managers. For example, the draft stage wiresdraft_prompt_io_manager(aGensPromptIOManager) anddraft_response_io_manager(aRehydratingIOManager). Adding a new stage only requires extending the registry. - Shared resources. The shared map injects the cohort spec compiler, membership service, experiment configuration, and CSV IO managers. These resources are accessed by cohort, generation, and reporting assets to resolve config and storage paths.
The build_definitions helper also loads the cohort assets, stage assets, raw
catalog assets (RAW_SOURCE_ASSETS), reporting assets, schedules, and asset
checks into a single Dagster Definitions object. The executor switches between
multi-process and in-process mode based on the DD_IN_PROCESS environment
variable.
All stage IO managers derive paths from Paths, which resolves the data root
(data/ by default or DAYDREAMING_DATA_ROOT when set). Key helpers include:
generation_dir(stage, gen_id)– gens store directory for a partition.prompt_path,raw_path,parsed_path,metadata_path– canonical file locations used by the unified runner and tests.cohort_membership_csv(cohort_id)– location of the canonical cohort membership table consumed byMembershipServiceResourceand reporting assets.
GensPromptIOManager writes prompt files into the gens store, and the
RehydratingIOManager reloads parsed responses or metadata when assets are
retrieved.
seed_cohort_metadata(called insidecohort_membership) populates each generation directory with ametadata.jsonstub so downstream stage runs know their parent IDs, templates, and cohort provenance before LLM calls happen.MembershipServiceResource.stage_gen_idsoffers a thin wrapper aroundutils.membership_lookup.stage_gen_ids, allowing assets and scripts to pull stage-specificgen_idlists for the active cohort. Reporting assets use this to scope their work without touching raw CSVs.ScoresAggregatorResourcereadsparsed.txtandmetadata.jsonfor each evaluationgen_id, providing the foundation forcohort_aggregated_scores.
- Author a cohort spec. Follow
docs/cohorts.mdto produce a spec bundle underdata/cohorts/<cohort_id>/spec/. - Materialize cohort assets. Run the
cohort_idandcohort_membershipassets. They load the spec viaCohortSpecResource, validate against the raw catalogs (raw_data.py), writemanifest.json,membership.csv, and seed gens metadata (seed_cohort_metadata). - Register partitions.
register_cohort_partitionsconsumes the returned membership DataFrame and registers add-only dynamic partitions for draft, essay, and evaluation stages. Useprune_dynamic_partitionsbefore this step if you need a clean slate. - Execute stage assets per partition. The unified runner renders prompts,
makes LLM calls (or copies parsed drafts), parses outputs, and persists
artefacts into the gens store. Parent/child relationships use the deterministic
gen_idlinks generated during cohort compilation. - Aggregate and analyse results. Reporting assets read persisted files via
PathsandMembershipServiceResource, produce cohort-scoped CSVs underdata/cohorts/<cohort_id>/reports/, and expose convenience pivots and variance analyses.
This end-to-end flow keeps the experiment reproducible: the cohorts document defines the contract, the architecture described here turns it into deterministic Dagster partitions, and the reporting layer consumes only the persisted artefacts and membership metadata.