Merge pull request #36 from Lightricks/aheden/build-data-spec-skill

DanielB945 · web-flow · commit fdd2727e2f17 · 2026-03-10T12:49:17.000+02:00
feat: add build-data-spec skill + fix segmentation CTE case sensitivity
diff --git a/.claude/skills/build-data-spec/SKILL.md b/.claude/skills/build-data-spec/SKILL.md
@@ -0,0 +1,152 @@
+---
+name: build-data-spec
+description: "Build a structured data spec document for analytics topics by exploring the dbt codebase, identifying relevant events/models/columns, and producing a ready-to-use markdown reference. Use when: (1) starting a new data analysis, (2) documenting events for a feature or domain, (3) creating a reference for an agent or analyst."
+tags: [analytics, dbt, documentation, data-spec]
+metadata:
+  author: aheden
+  version: "1.0"
+---
+
+# Build Data Spec
+
+## Target Repository
+
+```
+~/dwh-data-model-transforms
+```
+
+Remote: `origin` -> `github.com/Lightricks/dwh-data-model-transforms.git`
+Default branch: `develop`
+
+**All file reads, searches, and explorations must target this directory**, regardless of which repo this skill is invoked from. Use absolute paths (e.g., `~/dwh-data-model-transforms/models/`) or set working directory before running commands.
+
+## When to Use
+
+- User wants to analyze a feature, domain, or event category
+- User needs a reference document for an agent or analyst
+- User says "pull all events for X", "create a data spec", "document events for Y"
+- Starting an analysis that requires understanding which tables, columns, and filters to use
+
+
+## Output
+
+A markdown file saved to `~/ltx-analytics-agents/docs/{feature}_spec.md`.
+
+**Naming convention:** Use the feature name in snake_case (e.g., `brand_kits_spec.md`, `gen_space_spec.md`, `failed_generations_spec.md`). This filename must match what other agents (e.g., dashboard-builder) use to look up the spec.
+
+## Workflow
+
+### Phase 1: Scope the Topic
+
+Clarify with the user:
+
+1. **What to analyze** -- the feature, domain, or event category (e.g., "failed generations", "brand kit usage", "export events")
+2. **Which product** -- LTX Studio, LTX Model, API
+3. **Breadth** -- single feature deep-dive or cross-feature overview
+
+If the user's request is broad, propose a focused scope before proceeding.
+
+### Phase 2: Explore the Codebase
+
+Search systematically across all layers in `~/dwh-data-model-transforms`. Use parallel exploration agents for speed.
+
+**Search targets (in priority order):**
+
+All paths below are relative to `~/dwh-data-model-transforms/`.
+
+| Layer | Where to Look | What to Extract |
+|-------|--------------|-----------------|
+| **Event registry** | `docs/event-registry.yaml` | Canonical event names, key properties, status |
+| **Mart models** | `models/**/marts/` | Final columns, filters, action_name/action_category mapping |
+| **Intermediate models** | `models/**/intermediate/` | Business logic, joins, derived columns |
+| **Base models** | `models/base/` | Raw source columns, process_started/ended pairs |
+| **Macros** | `macros/` | Extraction logic, parsing, field derivation |
+| **Source definitions** | `models/sources.yml` | Raw event table names |
+| **Existing specs** | `docs/*_spec.md` | Related specs to cross-reference |
+
+**Search strategies:**
+
+- Filename search: `Glob` for model names containing the topic keyword
+- Content search: `Grep` for column names, event names, action categories
+- Semantic search: "How does X work?" scoped to relevant directories
+- YAML search: Look at `.yml` files alongside `.sql` for column descriptions and tests
+
+**Read priority:** Always read the SQL model files, not just YMLs. The SQL reveals:
+- Actual column derivation logic (CASE statements, COALESCEs, joins)
+- Filter conditions that define the event scope
+- Macro calls that generate columns
+- Incremental predicates and partition fields
+
+### Phase 3: Read Key Models
+
+For each relevant model found in Phase 2, read the full `.sql` file to extract:
+
+1. **TL;DR block** -- model purpose and key features
+2. **Config block** -- partition_by, cluster_by, schema, tags
+3. **Column definitions** -- all SELECT columns with their derivation logic
+4. **Filter conditions** -- WHERE clauses that scope the data
+5. **Join logic** -- how tables connect (especially start/end event joins)
+6. **Macro calls** -- which macros generate columns (read the macro too)
+
+Also read the `.yml` file for:
+- Column descriptions (especially "In this table:" context)
+- Accepted values tests (reveal valid column values)
+- Data quality tests (reveal important constraints)
+
+### Phase 4: Compile the Spec Document
+
+Write `~/ltx-analytics-agents/docs/{feature}_spec.md` following the structure in [references/spec-template.md](references/spec-template.md).
+
+**Required sections:**
+
+1. **Title + metadata** -- topic, last updated date
+2. **Overview** -- what the spec covers, key definitions
+3. **Primary tables** -- fully-qualified BigQuery table names, partition/cluster info
+4. **Key columns** -- organized by category (error/result, context, timing, parameters, user)
+5. **Filtering patterns** -- ready-to-use WHERE clauses for common scenarios
+6. **Sample analysis queries** -- 4-6 BigQuery queries answering likely questions
+7. **Model lineage** -- ASCII diagram showing source -> base -> intermediate -> mart flow
+8. **Key macros** -- macros involved in column derivation
+9. **Important notes** -- gotchas, caveats, edge cases
+
+**Writing guidelines:**
+
+- Use fully-qualified BigQuery table names (`` `project.schema.table` ``)
+- Include column types (STRING, BOOLEAN, INT64, TIMESTAMP, FLOAT64)
+- Show accepted values inline when known from tests
+- Always include `NOT is_lt_team` in example queries
+- Use partition column in WHERE for cost efficiency
+- Provide both simple filters and full analysis queries
+
+### Phase 5: Validate Completeness
+
+Before finalizing, check:
+
+- [ ] Every column referenced in queries exists in the column tables
+- [ ] Filtering patterns cover the main use cases the user described
+- [ ] Sample queries are syntactically valid BigQuery SQL
+- [ ] Lineage diagram traces from source through to mart
+- [ ] Important notes capture non-obvious behavior (nullability, edge cases, timing windows)
+- [ ] Table names use correct project/schema from model config blocks
+
+## Existing Specs as Reference
+
+Current data spec documents in the project:
+
+| File | Topic | Good Example Of |
+|------|-------|-----------------|
+| `docs/gen_space_events_spec.md` | Gen Space activity | Filtering patterns, page_workspace breakdown |
+| `docs/brand_kits_events_spec.md` | Brand Kit events | Event-to-column mapping, action_category usage |
+| `docs/gen_space_lightbox_actions_spec.md` | Lightbox/asset actions | UI-to-event mapping, cross-feature coverage |
+| `docs/ltxstudio_failed_generations_spec.md` | Failed generations | Error analysis, multi-layer column tracking |
+
+Read these for style and depth calibration when creating a new spec.
+
+## Checklist
+
+- [ ] Topic scoped and confirmed with user
+- [ ] Codebase explored across all layers (mart -> intermediate -> base -> macro)
+- [ ] Key model SQL files read (not just YMLs)
+- [ ] Spec document written with all required sections
+- [ ] Queries validated for correct table names and syntax
+- [ ] File saved to `~/ltx-analytics-agents/docs/{feature}_spec.md`
diff --git a/.claude/skills/build-data-spec/references/spec-template.md b/.claude/skills/build-data-spec/references/spec-template.md
@@ -0,0 +1,188 @@
+# Data Spec Template
+
+Use this template when creating a new data spec document. Adapt sections as needed for the specific topic -- not every spec needs every section, and some topics will need additional sections.
+
+---
+
+## File Naming
+
+```
+docs/{feature}_spec.md
+```
+
+Examples:
+- `docs/ltxstudio_failed_generations_spec.md`
+- `docs/gen_space_events_spec.md`
+- `docs/brand_kits_events_spec.md`
+
+---
+
+## Template
+
+```markdown
+# {Feature/Topic} - Data Spec
+
+**Last Updated**: {Month Year}
+**Context**: {Optional link to Figma, ticket, or brief context}
+
+---
+
+## Overview
+
+{1-3 paragraphs explaining what this spec covers, key definitions, and the core question it helps answer.}
+
+---
+
+## Primary Table(s)
+
+### Recommended: {Layer} Layer
+
+\`\`\`
+{fully-qualified BigQuery table name}
+\`\`\`
+
+{1-2 sentences on what it contains and why it's recommended.}
+
+**Partition**: `{column}` ({type}, {granularity})
+**Cluster**: `{col1}`, `{col2}`, `{col3}`
+
+### Alternative: {Layer} Layer (optional section)
+
+\`\`\`
+{alternative table if applicable}
+\`\`\`
+
+{When to use this instead.}
+
+---
+
+## Key Columns
+
+### {Category 1} Columns (e.g., Error & Result, Event Identity, etc.)
+
+| Column | Type | Description | Available In |
+|--------|------|-------------|-------------|
+| `column_name` | STRING | Description of what it contains | Both tables / Mart only |
+
+### {Category 2} Columns (e.g., Context, Timing, Parameters)
+
+| Column | Type | Description |
+|--------|------|-------------|
+| `column_name` | TYPE | Description |
+
+{Repeat for as many categories as needed. Common groupings:}
+{- Error/Result columns}
+{- Event identity columns (action_name, action_category)}
+{- Timing columns}
+{- Feature parameter columns}
+{- User & context columns}
+
+---
+
+## Filtering Patterns
+
+### {Scenario 1} (e.g., "All X events")
+
+\`\`\`sql
+SELECT *
+FROM \`project.schema.table\`
+WHERE {conditions}
+  AND NOT is_lt_team
+  AND date(action_ts) >= CURRENT_DATE() - 30
+\`\`\`
+
+### {Scenario 2}
+
+\`\`\`sql
+...
+\`\`\`
+
+{Include 3-5 filtering patterns covering the most common analysis scenarios.}
+
+---
+
+## Sample Analysis Queries
+
+### {Analysis 1} (e.g., "Daily event counts")
+
+\`\`\`sql
+SELECT
+  date(action_ts) AS dt,
+  {dimensions},
+  COUNT(*) AS total,
+  {metrics}
+FROM \`project.schema.table\`
+WHERE {conditions}
+  AND NOT is_lt_team
+  AND date(action_ts) >= CURRENT_DATE() - 30
+GROUP BY 1, 2
+ORDER BY 1 DESC
+\`\`\`
+
+{Include 4-6 queries answering the most likely analysis questions.}
+{Each query should have a descriptive heading.}
+
+---
+
+## Model Lineage
+
+\`\`\`
+Sources (BigQuery raw events)
+    |
+    +-- {source_table}  --> {base_model}
+    |                          |
+    |                          v
+    |                    {intermediate_model}
+    |                          |
+    |                          v
+    |                    {mart_model}  [MART]
+\`\`\`
+
+---
+
+## Key Macros (optional -- include when macros are involved)
+
+| Macro | Location | Purpose |
+|-------|----------|---------|
+| `macro_name` | `macros/path/to/macro.sql` | What it does |
+
+---
+
+## Important Notes
+
+1. **{Gotcha 1}**: Explanation of non-obvious behavior.
+
+2. **{Gotcha 2}**: Edge case or caveat.
+
+{Include 3-6 notes covering:}
+{- Column semantics that are easy to confuse}
+{- Filtering gotchas (e.g., excluded event types)}
+{- Null behavior and edge cases}
+{- Internal user exclusion reminders}
+{- Timing windows or data freshness caveats}
+```
+
+---
+
+## Style Guidelines
+
+1. **Table names**: Always fully qualified with backtick escaping in queries (`` \`project.schema.table\` ``)
+2. **Column types**: Use BigQuery types (STRING, INT64, FLOAT64, BOOLEAN, TIMESTAMP, DATE, ARRAY)
+3. **Accepted values**: Show inline when known (e.g., `'success'` or `'failure'`)
+4. **Queries**: Always include `NOT is_lt_team` and partition-based date filter
+5. **Headings**: Use `###` for sub-sections within main `##` sections
+6. **Tables**: Use markdown tables for structured column/event documentation
+7. **Code blocks**: Use `sql` language tag for queries, plain ``` for table names and diagrams
+8. **Lineage**: ASCII art with arrows for flow direction
+9. **Tone**: Direct and reference-like -- this is a lookup document, not a tutorial
+
+## Calibration
+
+Look at these existing specs for style and depth:
+
+- `docs/gen_space_events_spec.md` -- medium complexity, good filtering patterns
+- `docs/brand_kits_events_spec.md` -- feature-specific, event-to-column mapping
+- `docs/ltxstudio_failed_generations_spec.md` -- multi-layer analysis, error handling
+- `docs/gen_space_lightbox_actions_spec.md` -- UI-to-event mapping, comprehensive
+
+Match the depth and structure of the most similar existing spec.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -26,6 +26,7 @@ If the request doesn't clearly match an agent, ask the user which they need.
 |---|---|---|
 | Create a PR, open a pull request, ship it, submit for review | **Create PR** | `.claude/skills/create-pr/SKILL.md` |
 | Create a new skill, automate a repeatable task, document a team pattern | **Create Skill** | `.claude/skills/create-skill/SKILL.md` |
+| Build a data spec, document events for a feature, pull all events for X | **Build Data Spec** | `.claude/skills/build-data-spec/SKILL.md` |
 
 ## Shared Knowledge
 
diff --git a/agents/dashboard-builder/SKILL.md b/agents/dashboard-builder/SKILL.md
@@ -72,7 +72,13 @@ PHASE 4: VALIDATE → QA Report + Slack notification
    - Extract any context that helps understand the feature's purpose, target users, or success criteria
    - Note any stakeholder comments about what metrics they care about
 
-6. **Present Feature Brief** using template from `agents/dashboard-builder/templates/feature-brief.md`:
+6. **Build or load a Data Spec:**
+   - Check if `docs/{feature}_spec.md` already exists
+   - If it exists, read it — it contains verified table names, key columns, filtering patterns, and sample queries
+   - If it doesn't exist, follow the build-data-spec skill (`.claude/skills/build-data-spec/SKILL.md`) to create one by exploring the dbt codebase at `~/dwh-data-model-transforms`
+   - The data spec feeds directly into Phase 2: use its table names, column details, and filtering patterns when building the Hex Context Brief
+
+7. **Present Feature Brief** using template from `agents/dashboard-builder/templates/feature-brief.md`:
    - List all events with verification status (✅ registry / ⚠️ code only / ❓ unverified)
    - Propose funnel steps
    - Propose retention activation + return events
@@ -350,6 +356,8 @@ After each metric executes in Hex, review the results and check for:
 ### Agent-specific
 | File | Read in phase |
 |------|---------------|
+| `docs/{feature}_spec.md` (if exists) | Phase 1, Phase 2 |
+| `.claude/skills/build-data-spec/SKILL.md` | Phase 1 (if spec missing) |
 | `agents/dashboard-builder/hex-prompts/patterns.md` | Phase 3 |
 | `agents/dashboard-builder/templates/feature-brief.md` | Phase 1 |
 | `agents/dashboard-builder/templates/dashboard-spec.md` | Phase 2 |
diff --git a/docs/failed_generations_spec.md b/docs/failed_generations_spec.md
diff --git a/shared/bq-schema.md b/shared/bq-schema.md