diff --git a/docs/docs.json b/docs/docs.json index 09c78e1..99aac1a 100644 --- a/docs/docs.json +++ b/docs/docs.json @@ -188,6 +188,17 @@ { "tab": "Guides", "groups": [ + { + "group": "Data Analysis", + "pages": [ + "guides/data/overview", + "guides/data/foundation", + "guides/data/ad-hoc-analysis", + "guides/data/repeated-reports", + "guides/data/pipelines", + "guides/data/migrations" + ] + }, { "group": "Power User", "pages": [ diff --git a/docs/guides/data/ad-hoc-analysis.mdx b/docs/guides/data/ad-hoc-analysis.mdx new file mode 100644 index 0000000..555d634 --- /dev/null +++ b/docs/guides/data/ad-hoc-analysis.mdx @@ -0,0 +1,182 @@ +--- +title: "Part 2: Ad-Hoc Analysis" +description: Product analytics, billing queries, engineering metrics, and structuring effective questions. +keywords: ['data', 'analytics', 'sql', 'queries', 'product', 'billing', 'metrics'] +--- + +This is where agent-native data work really shines. The feedback loop between question and answer shrinks from minutes to seconds. + +Traditional data analysis involves a lot of friction. You have a question. You open your SQL client. You remember (or look up) which tables you need. You write the query. You debug the syntax error. You realize you used the wrong field. You fix it. You get your answer. By then, you've forgotten why you asked. + +With an agent that knows your data, the workflow becomes conversational. You describe what you want to know in plain language. The agent writes the query, runs it, and shows you results. If something looks wrong, you ask follow-up questions. + +The key is learning how to structure these conversations for different types of analysis. + +## Product Analytics + +Understanding user behavior is one of the most common use cases. The questions usually follow a pattern: + +**User-level questions:** +- "What did user X do yesterday?" +- "Who are the most active users in org Y?" +- "Show me users who haven't logged in for 30 days" + +**Org-level questions:** +- "How is org X trending over the past month?" +- "Which orgs have the highest engagement?" +- "Show me orgs that started a trial but haven't converted" + +Here's how a real conversation might look: + +``` +You: Show me the top 10 users by consumption at Acme Corp this month + +Agent: I'll query the user activity table for that org. + +[runs query] + +| user_email | usage_30d | messages | +|------------------------|----------------|----------| +| alice@acme.com | 45.2M | 3,241 | +| bob@acme.com | 38.7M | 2,892 | +| charlie@acme.com | 22.1M | 1,456 | +... + +Alice is consuming about 17% more than the next highest user. +Her average usage per message (13,946) is also higher than +the org average (11,203), suggesting she's working on more +complex tasks. +``` + +The agent interprets results and surfaces interesting patterns automatically. + +## Billing and Revenue Analysis + +Billing data requires extra care because mistakes can affect real money. The agent needs to know which tables are authoritative. + +Your AGENTS.md should encode your source of truth. For example, if you use Stripe, Orb, Chargebee, or another billing system: + +```markdown +## Billing Source of Truth + +- Organization-level billing: Use `dim_customers.mrr` (synced from [your billing system]) +- User-level attribution: Use `fct_user_usage` (approximation from event data) +- Raw invoice data: Use `billing.invoices` + +Never use `events.amount` for billing - it's unreliable. +``` + +With this knowledge, the agent can answer questions like: + +``` +You: What's our monthly revenue by plan type? + +Agent: I'll aggregate from the billing tables. + +[runs query] + +| plan_name | mrr | customers | avg_per_customer | +|-----------------|---------------|------------|------------------| +| Enterprise | $42,361,500 | 1,150 | $36,836 | +| Pro | $11,728,000 | 62,350 | $188 | +| Free | $0 | 934,000 | $0 | + +Enterprise customers represent 78% of revenue from +less than 1% of the customer base. +``` + +## Engineering Metrics + +Engineering teams generate tons of useful data that often goes unanalyzed. + +**Error analysis:** +- "What are our most common error types this week?" +- "Which endpoints have the highest latency?" +- "Show me errors correlated with the deploy on Tuesday" + +**Pipeline health:** +- "When did our ETL job last run successfully?" +- "Are there any data freshness issues?" +- "How long is our dbt run taking?" + +**Usage patterns:** +- "Which API endpoints are getting the most traffic?" +- "What's our p99 response time trend?" +- "Show me requests that are hitting rate limits" + +If your observability data is queryable (whether you're using Datadog Log Archives, Axiom, OpenTelemetry exports to your warehouse, or logs in S3/GCS), the agent can analyze it the same way: + +``` +You: Show me the top error messages from the last 24 hours + +Agent: I'll query the traces table for error spans. + +[runs query] + +| error_message | count | affected_users | +|---------------------------------------|-------|----------------| +| Connection timeout to postgres | 1,247 | 89 | +| Rate limit exceeded | 834 | 234 | +| Invalid authentication token | 412 | 156 | + +The postgres connection timeouts spiked around 3pm yesterday. +That correlates with when we deployed v2.4.1. Want me to look +at which specific queries were timing out? +``` + +## Structuring Good Questions + +Some question patterns work better than others. Here's what we've learned: + + + + - Bad: "Show me user activity" + - Good: "Show me user activity for the past 7 days" + + + - Bad: "What's our usage looking like?" + - Good: "What's usage looking like for Acme Corp?" + + + - Bad: "Is this number good?" + - Good: "How does this compare to last month?" + + + - Bad: "Show me the data" + - Good: "Show me a breakdown by day in a table" + + + +The agent will usually figure out what you mean, but explicit questions get better results faster. + +## When Things Go Wrong + +Sometimes the agent will write a query that returns unexpected results. This is where domain knowledge matters. + +Common issues we've seen: + +**Wrong table**: The agent uses `events_raw` instead of the deduped staging table. Results are inflated due to duplicate events. Fix: add this to your pitfalls doc so it never happens again. + +**Missing filter**: The agent forgets to filter inactive records. Numbers include churned customers. Fix: make the default query template include the filter. + +**Timezone mismatch**: The agent uses UTC timestamps when comparing to billing data that uses a different timezone. Daily totals don't match. Fix: document the timezone convention and the correct conversion pattern for your warehouse. + +Each of these becomes a permanent lesson. You add it to your AGENTS.md or skills, and the agent won't make that mistake again. This is the compounding effect. + + +When you catch an agent mistake, always add it to your pitfalls documentation. This turns a one-time fix into a permanent lesson. + + +## Building Intuition + +The more you use agent-assisted analysis, the more you develop intuition for what's possible. Questions that used to feel like "big projects" become quick checks: + +- "Before this meeting, pull up the usage trends for these 5 accounts" +- "That bug report mentions slow queries. What's the p99 for that endpoint?" +- "The CEO asked about trial conversions. What's our current rate?" + +The agent becomes an extension of your own analytical capability. You start asking questions you wouldn't have bothered with before because the cost of getting an answer is so low. + +## What's Next + +[Part 3](/guides/data/repeated-reports) covers building repeated reports: weekly metrics, automated monitoring, and SaaS replacement analysis. diff --git a/docs/guides/data/foundation.mdx b/docs/guides/data/foundation.mdx new file mode 100644 index 0000000..5c63e3c --- /dev/null +++ b/docs/guides/data/foundation.mdx @@ -0,0 +1,210 @@ +--- +title: "Part 1: Building Your Data Foundation" +description: Connect agents to your data warehouse and teach them your schema and domain knowledge. +keywords: ['data', 'warehouse', 'bigquery', 'snowflake', 'schema', 'skills', 'setup'] +--- + +The biggest unlock for agent-based data work is the access layer. + +When you give an agent the ability to query your data warehouse, something interesting happens. The barrier between "I wonder if..." and "here's the answer" collapses. Questions that used to require context-switching into a SQL client, remembering table schemas, and debugging syntax errors now flow naturally in conversation. + +But getting there requires some upfront investment. You need to connect your agent to your data sources, teach it your schema conventions, and encode the tribal knowledge that prevents costly mistakes. + +## The Access Layer + +Your agent needs three things to work effectively with data: + +1. **Tool access** to execute queries +2. **Schema knowledge** to know what's available +3. **Domain knowledge** to avoid common mistakes + +## Connecting to Your Warehouse + +The simplest path is giving your agent access to your warehouse's CLI. Once authenticated, it can run any query directly. + + + + ```bash + gcloud auth application-default login --account=you@company.com + gcloud config set project your-project + bq query --use_legacy_sql=false "SELECT * FROM analytics.users LIMIT 10" + ``` + + + ```bash + snowsql -c your_connection -q "SELECT * FROM analytics.users LIMIT 10" + ``` + + + ```bash + databricks sql execute --sql "SELECT * FROM analytics.users LIMIT 10" + ``` + + + ```bash + psql -h your-cluster.redshift.amazonaws.com -U user -d analytics -c "SELECT * FROM users LIMIT 10" + ``` + + + ```bash + clickhouse-client --query "SELECT * FROM analytics.users LIMIT 10" + ``` + + + ```bash + duckdb analytics.db "SELECT * FROM users LIMIT 10" + ``` + + + ```bash + psql -c "SELECT * FROM users LIMIT 10" + ``` + + + +The key is that your agent can execute queries and see results directly. Every major warehouse has a CLI, and they all work the same way from the agent's perspective: run a command, get tabular output. + + +If you're using MCP (Model Context Protocol), there are pre-built servers for most data warehouses. MCP servers handle connection pooling and provide a standardized interface. But for most teams, the CLI approach is simpler to set up and debug. + + +## Teaching Schema Knowledge + +Raw CLI access isn't enough. Your agent needs to know what tables exist and what they contain. + +The naive approach is dumping your entire schema into context. This works for small warehouses but falls apart quickly. A better pattern is building a quick-reference guide that maps common questions to specific tables. + +Here's an example: + +```markdown +## Quick Decision Guide + +| What You Need | Use This Table | Example | +|---------------|----------------|---------| +| Org-level metrics (30d, lifetime) | `dim_organizations` | `SELECT revenue_30d FROM dim_organizations WHERE id = 'ORG_ID'` | +| User-level activity | `fct_user_activity` | `SELECT user_email, activity_30d FROM fct_user_activity WHERE organization_id = 'ORG_ID'` | +| Session details | `fct_sessions` | `SELECT * FROM fct_sessions WHERE session_id = 'SESSION_ID'` | +| Trial analytics | `fct_trials` | `SELECT * FROM fct_trials WHERE is_trial_active = true` | +``` + +This approach scales because you're encoding the answer to "which table do I use for X?" rather than trying to document every column. + +## Encoding Domain Knowledge with Skills + +Schema knowledge tells your agent what exists. Domain knowledge tells it how to use things correctly. + +One effective pattern is organizing domain knowledge into "skills" that get loaded based on context. Each skill is a focused document covering one domain. + +``` +.factory/skills/ +├── salesforce.md # CRM queries and account lookups +├── bigquery-tables.md # Table schemas and structure +├── otel-pipeline.md # OpenTelemetry trace ingestion +├── token-queries.md # Usage and billing metrics +└── reports.md # Generating recurring reports +``` + +The critical one is `data-pitfalls`. This is the list of everything that will bite you if you don't know about it: + +```markdown +## Common Pitfalls + +### Don't Query Raw Tables Directly +- ❌ `events_raw` often has duplicates from ingestion +- ✅ Use `stg_events` or your deduped staging layer + +### Always Filter Inactive Records +- ❌ Counting all records in production analytics +- ✅ Add `WHERE is_active = true` or equivalent + +### Use Native Timezone Handling +- ❌ `DATEADD(hour, -8, timestamp)` breaks twice yearly for DST +- ✅ Use your warehouse's timezone conversion: + - BigQuery: `DATETIME(timestamp, 'America/Los_Angeles')` + - Snowflake: `CONVERT_TIMEZONE('America/Los_Angeles', timestamp)` + - Postgres: `timestamp AT TIME ZONE 'America/Los_Angeles'` + +### Never Query "Today" for Financial Data +- ❌ `revenue_today` is incomplete due to timing lag +- ✅ Use `revenue_yesterday` or completed periods for reliable comparisons +``` + +This is the compounding part. Every time someone discovers a gotcha, it goes into the pitfalls doc. The agent learns it permanently. The mistake never happens again. + +## Enabling Your Team + +Once you have the access layer working, sharing it with your team is straightforward. Check everything into your repo: + +``` +your-data-repo/ +├── AGENTS.md # Main agent instructions +├── .factory/skills/ # Domain-specific knowledge +│ ├── data-pitfalls.md +│ └── revenue-queries.md +├── dbt/ # Transformation models +└── docs/ + └── QUICK_REFERENCE.md # Table lookup guide +``` + +New team members clone the repo and immediately have an agent that understands your data warehouse. The instructions file at the root gives the overview: + +```markdown +# Data Engineering Agent Guidelines + +## Quick Start +1. Authenticate with your warehouse CLI +2. Query data: [your warehouse command here] +3. For transformation work: `cd dbt && dbt run --select model_name` + +## Table Quick Reference +[mapping table here] + +## Common Mistakes +[pitfalls here] +``` + +The goal is that anyone on your team can ask "show me our highest-usage customers this month" and get an accurate answer without knowing which tables to use or which gotchas to avoid. + +## Platform-Specific Setup + + + + | Platform | Install | Auth | Test Query | + |----------|---------|------|------------| + | BigQuery | `brew install google-cloud-sdk` | `gcloud auth application-default login` | `bq query "SELECT 1"` | + | Snowflake | `brew install snowflake-snowsql` | SSO via config file | `snowsql -q "SELECT 1"` | + | Databricks | `pip install databricks-cli` | Personal access token | `databricks sql execute --sql "SELECT 1"` | + | Redshift | `psql` (built-in) | IAM or password | `psql -c "SELECT 1"` | + + + + | Platform | Install | Test Query | + |----------|---------|------------| + | ClickHouse | `brew install clickhouse` | `clickhouse-client --query "SELECT 1"` | + | DuckDB | `brew install duckdb` | `duckdb -c "SELECT 1"` | + | Postgres | `brew install postgresql` | `psql -c "SELECT 1"` | + | Trino | `brew install trino` | `trino --execute "SELECT 1"` | + + + + | Tool | Install | Config Location | Run Models | + |------|---------|-----------------|------------| + | dbt | `pip install dbt-[adapter]` | `~/.dbt/profiles.yml` | `dbt run` | + | SQLMesh | `pip install sqlmesh` | `config.yaml` | `sqlmesh run` | + | Dataform | Via GCP Console | `dataform.json` | `dataform run` | + + + + | Tool | Data Access | + |------|-------------| + | Datadog | API or Log Archive to S3/GCS | + | Axiom | Direct SQL via ClickHouse protocol | + | Honeycomb | Query API | + | Grafana/Loki | LogQL or direct datasource | + | OpenTelemetry | Export to your warehouse | + + + +## What's Next + +With the foundation in place, you can start doing real work. [Part 2](/guides/data/ad-hoc-analysis) covers ad-hoc analysis: the kinds of questions you can ask and what to expect when things are working well. diff --git a/docs/guides/data/migrations.mdx b/docs/guides/data/migrations.mdx new file mode 100644 index 0000000..84ec71b --- /dev/null +++ b/docs/guides/data/migrations.mdx @@ -0,0 +1,308 @@ +--- +title: "Part 5: Data Migrations" +description: Schema mapping, transformation scripts, validation queries, and coordinating cutovers. +keywords: ['migration', 'schema', 'validation', 'cutover', 'etl'] +--- + +The final frontier: moving data between systems. Database migrations, warehouse consolidation, table restructuring. + +These projects are traditionally painful. Schema mapping, data validation, handling edge cases, coordinating cutovers. Agents don't eliminate the complexity, but they dramatically speed up the tedious parts. + +## The Migration Pattern + +Most migrations follow a pattern: + +1. **Map** the source schema to the target +2. **Transform** data to fit the new shape +3. **Validate** that nothing was lost or corrupted +4. **Cutover** to the new system +5. **Clean up** the old one + +Agents can help with each step. + +## Schema Mapping + +Say you're migrating between databases (Postgres to Snowflake, MySQL to BigQuery, on-prem to cloud, or even just restructuring within the same warehouse). First step is understanding both schemas: + +``` +You: Compare the schema of our source 'users' table with our +target 'dim_users' table. Identify any fields that exist in +one but not the other. Note any type mismatches. +``` + +The agent queries both systems: + +``` +Agent: Here's the comparison: + +Fields in source only: +- legacy_id (integer) - appears to be an old ID system +- phone_number (varchar) - not in target + +Fields in target only: +- user_id (string) - using new ID format +- enrichment_data (struct/json) - added from analytics pipeline + +Type mismatches: +- created_at: timestamp vs datetime +- preferences: jsonb vs string + +The legacy_id field maps to user_id via a lookup table +'legacy_user_mapping'. +``` + +This gives you a clear picture of what needs to happen. + +## Transformation Scripts + +Once you know the mapping, the agent can generate transformation logic: + +``` +You: Generate a transformation model that migrates the source users +data to match our target dim_users schema. Handle the legacy_id to +user_id mapping and convert the preferences field appropriately. +``` + +```sql +-- models/migrations/mig_users_source_to_target.sql +{{ + config( + materialized='table', + tags=['migration', 'one-time'] + ) +}} + +with source_users as ( + select * from {{ source('source_export', 'users') }} +), + +id_mapping as ( + select * from {{ source('source_export', 'legacy_user_mapping') }} +), + +transformed as ( + select + m.new_user_id as user_id, + p.email, + p.name, + cast(p.created_at as datetime) as created_at, + -- JSON conversion syntax varies by warehouse + cast(p.preferences as string) as preferences, + -- Fields not in source, will be enriched later + null as enrichment_data + from source_users p + left join id_mapping m on p.legacy_id = m.legacy_id +) + +select * from transformed +``` + +## Validation Queries + +Before cutting over, you need confidence that the migration worked: + +``` +You: Generate validation queries that compare row counts, null rates, +and sample values between source and target. +``` + +```sql +-- Validation 1: Row counts +SELECT 'source' as system, COUNT(*) as row_count +FROM source_export.users +UNION ALL +SELECT 'target', COUNT(*) +FROM analytics.dim_users; + +-- Validation 2: Null rates for key fields +SELECT + 'source' as system, + SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as email_null_rate, + SUM(CASE WHEN created_at IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) as created_null_rate +FROM source_export.users +UNION ALL +SELECT + 'target', + SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*), + SUM(CASE WHEN created_at IS NULL THEN 1 ELSE 0 END) * 1.0 / COUNT(*) +FROM analytics.dim_users; + +-- Validation 3: Sample comparison +SELECT + s.legacy_id, + s.email as source_email, + t.email as target_email, + s.email = t.email as email_matches +FROM source_export.users s +JOIN analytics.dim_users t + ON s.legacy_id = (SELECT legacy_id FROM source_export.legacy_user_mapping WHERE new_user_id = t.user_id) +LIMIT 100; +``` + +The agent can run these and interpret results: + +``` +Agent: Validation results: + +✅ Row counts match: 167,432 in both +✅ Email null rate: 0.02% in both +⚠️ Created_at null rate: 0.00% (source) vs 0.01% (target) - 12 records + +The 12 records with null created_at in target are users with +invalid timestamps in source (year 0001). I can show you +those records if you want to handle them specially. +``` + +## Handling Edge Cases + +Migrations always surface weird data. The agent can help investigate: + +``` +You: Show me those 12 users with invalid created_at timestamps. +What's special about them? +``` + +``` +Agent: Here they are: + +| legacy_id | email | created_at (pg) | +|-----------|--------------------|--------------------| +| 8234 | test1@example.com | 0001-01-01 00:00 | +| 8235 | test2@example.com | 0001-01-01 00:00 | +... + +These all have emails containing "test" or "example.com" and +were created in a single batch on 2019-03-15. They appear to +be test accounts from the early days. The invalid timestamp +was probably a default value in the old system. + +Recommendation: Set their created_at to the batch creation +date (2019-03-15) or exclude them if they're pure test data. +``` + +## Coordinating Cutovers + +The scariest part of migration is the cutover. The agent can help plan: + +``` +You: Generate a cutover checklist for switching from the source +users table to the target dim_users. Include rollback steps. +``` + + + + - Final validation queries pass + - All downstream consumers identified + - Rollback procedure tested + - On-call team notified + + + - Stop writes to source users table + - Run final incremental sync + - Validate row counts match + - Update application connection strings + - Deploy application changes + - Monitor error rates for 30 minutes + - Confirm queries hitting new table + + + - Revert application connection strings + - Deploy rollback application changes + - Verify queries hitting source + - Document what went wrong + + + - Confirm no queries hitting old table + - Archive source users table + - Update documentation + - Close migration ticket + + + +## Table Restructuring + +Sometimes you're not migrating between systems, just reorganizing within one. + +``` +You: Our dim_organizations table has gotten bloated. Split it into +dim_organizations (core fields only) and fct_org_metrics (the +computed metrics). Show me how the dependent models need to change. +``` + +The agent can: +1. Identify which fields are "core" vs "computed" +2. Generate the two new models +3. Find all models that reference dim_organizations +4. Generate the SQL changes needed for each +5. Suggest an order of operations that minimizes breakage + +This kind of refactoring used to be a multi-day project. With agent assistance, it's a few hours of review and testing. + +## Migration Best Practices + + + + Keep the old table around for at least a month after cutover. You'll be glad you did. + + + + If applications read from the table: + ```python + if feature_flags.use_new_users_table: + query = "SELECT * FROM analytics.dim_users" + else: + query = "SELECT * FROM postgres.users" + ``` + This lets you roll back without deploying code. + + + + Instead of big-bang migrations: + 1. Set up continuous sync from old to new + 2. Migrate readers one at a time + 3. Once all readers migrated, stop writing to old + 4. Clean up + + + + Future you will not remember why you made certain decisions: + ```markdown + ## Migration Notes: users table (2025-01) + + ### Why we migrated + Source users table was 50GB and growing. Queries were slow. + Target warehouse gives us columnar storage and better analytics tooling. + + ### Decisions made + - Dropped phone_number field (unused, only 2% populated) + - Converted legacy_id to new user_id format + - Invalid timestamps set to 2019-03-15 (batch creation date) + + ### Known issues + - 3 duplicate emails exist (test accounts, left as-is) + - enrichment_data field populated async, may be null for new users + ``` + + + +## Closing Thoughts + +We've covered a lot of ground: building data access layers, doing ad-hoc analysis, creating repeated reports, building pipelines, and running migrations. + +The common thread is that agents change the economics of data work. Things that used to require specialized knowledge or significant time investment become accessible to anyone who can describe what they want. + +This doesn't eliminate the need for data expertise. Someone still needs to design the schema, understand the business logic, and catch mistakes. But it dramatically lowers the barrier to getting work done. + +### The Practical Advice + +1. **Start with access.** Get your agent connected to your data warehouse. Make sure it can run queries. + +2. **Encode your knowledge.** Build an AGENTS.md that captures your schema conventions and common pitfalls. Add to it every time something goes wrong. + +3. **Use skills for depth.** When you have complex domains (billing, timezones, specific pipelines), create focused documentation that the agent can load when relevant. + +4. **Let the agent iterate.** Don't try to specify everything upfront. Describe what you want, see what you get, refine. The feedback loop is fast. + +5. **Compound your learnings.** Every mistake the agent makes should become a permanent lesson in your documentation. This is how the system gets better over time. + +The tools will keep improving. But the patterns in this series should remain useful: give agents access, teach them your domain, and let them help you move faster. diff --git a/docs/guides/data/overview.mdx b/docs/guides/data/overview.mdx new file mode 100644 index 0000000..cd44200 --- /dev/null +++ b/docs/guides/data/overview.mdx @@ -0,0 +1,65 @@ +--- +title: Agent Native Data Analysis +description: A comprehensive guide to using coding agents for data science and engineering work. +keywords: ['data', 'analytics', 'sql', 'bigquery', 'snowflake', 'dbt', 'etl', 'pipeline'] +--- + +A 5-part series on the new way to use agents for data science and engineering. + + +On the web version of this guide, you can select your specific tools (BigQuery, Snowflake, Databricks, etc.) to see tailored examples. The patterns are the same across platforms; only the syntax differs. + + +## What You'll Learn + +This guide covers the full spectrum of agent-assisted data work: + + + + Connect agents to your warehouse, teach schema knowledge, and encode domain expertise through skills. + + + Product analytics, billing queries, engineering metrics, and structuring effective questions. + + + Weekly engineering reports, flaky test tracking, SaaS replacement analysis, and scheduling. + + + SQL-based transformations with dbt/SQLMesh, incremental loads, and debugging pipeline issues. + + + Schema mapping, transformation scripts, validation queries, and coordinating cutovers. + + + +## The Core Insight + +The biggest unlock for agent-based data work is the access layer. When you give an agent the ability to query your data warehouse, the barrier between "I wonder if..." and "here's the answer" collapses. + +But getting there requires upfront investment: + +1. **Tool access** to execute queries +2. **Schema knowledge** to know what's available +3. **Domain knowledge** to avoid common mistakes + +This guide walks through practical patterns you can adapt for whatever stack you're running: BigQuery, Snowflake, Databricks, Redshift, ClickHouse, DuckDB, Postgres, or any SQL-compatible warehouse. + +## Supported Platforms + +| Category | Platforms | +|----------|-----------| +| **Cloud Warehouses** | BigQuery, Snowflake, Databricks, Redshift | +| **Open Source** | ClickHouse, DuckDB, Postgres, Trino, MySQL | +| **Transformation** | dbt, SQLMesh, Dataform | +| **Observability** | Datadog, Axiom, Honeycomb, OpenTelemetry | + +## Quick Start + +If you want to get started immediately: + +1. Ensure your agent can access your warehouse CLI (see [Part 1](/guides/data/foundation)) +2. Create a quick-reference table mapping common questions to tables +3. Document your top 5 data pitfalls in your AGENTS.md +4. Start asking questions + +The rest of this guide provides depth on each step, with examples and patterns you can adapt. diff --git a/docs/guides/data/pipelines.mdx b/docs/guides/data/pipelines.mdx new file mode 100644 index 0000000..fbd4983 --- /dev/null +++ b/docs/guides/data/pipelines.mdx @@ -0,0 +1,271 @@ +--- +title: "Part 4: Data Pipelines" +description: SQL-based transformations with dbt/SQLMesh, incremental loads, and debugging pipeline issues. +keywords: ['dbt', 'sqlmesh', 'etl', 'pipeline', 'transformation', 'incremental'] +--- + +Scripts are great for one-off reports. But when you need data transformed consistently, you want a pipeline. + +Data pipelines take raw data from sources, transform it into useful shapes, and load it somewhere queryable. The classic ETL pattern. What's different with agents is how you build and maintain these pipelines. + +Instead of writing every transformation by hand, you describe what you want the data to look like. The agent generates the dbt models, serverless functions, or whatever infrastructure you need. When requirements change, you describe the change and the pipeline updates. + +## SQL-Based Transformation Tools + +The most common approach to building transformation pipelines is with SQL-based tools like dbt, SQLMesh, or Dataform. They're version controlled, testable, and well understood by agents. + + +The concepts are identical across tools. Pick your transformation tool (dbt, SQLMesh, Dataform) and the patterns translate directly. + + +A typical project structure: + +``` +dbt/ +├── models/ +│ ├── staging/ # Light transformations from raw +│ │ └── stg_events.sql +│ ├── intermediate/ # Business logic +│ │ └── int_org_metrics.sql +│ └── marts/ # Final tables for analysis +│ ├── dim_organizations.sql +│ └── fct_sessions.sql +├── dbt_project.yml +└── profiles.yml +``` + +The naming convention encodes the purpose: +- `stg_` = staging (clean up raw data) +- `int_` = intermediate (apply business logic) +- `dim_` = dimension (slowly changing entities) +- `fct_` = fact (events and transactions) + +## Building a New Model + +Say you need a new fact table tracking user sessions. The conversation might go: + +``` +You: Create a dbt model called fct_sessions that aggregates session +data from stg_events. Include session_id, user_id, org_id, +start_time, end_time, message_count, and total_usage. Partition by +date and cluster by org_id. +``` + +The agent generates: + +```sql +-- models/marts/fct_sessions.sql +{{ + config( + materialized='table', + partition_by={ + "field": "session_date", + "data_type": "date", + "granularity": "day" + }, + cluster_by=["organization_id"] + ) +}} + +with session_events as ( + select + session_id, + user_id, + organization_id, + timestamp, + cast(json_value(properties, '$.value') as float64) as usage + from {{ ref('stg_events') }} + where event_type = 'activity' +), + +session_metrics as ( + select + session_id, + user_id, + organization_id, + date(min(timestamp)) as session_date, + min(timestamp) as session_start, + max(timestamp) as session_end, + count(*) as message_count, + sum(usage) as total_usage + from session_events + group by session_id, user_id, organization_id +) + +select * from session_metrics +``` + +Plus the YAML documentation: + +```yaml +models: + - name: fct_sessions + description: Aggregated session metrics from events + columns: + - name: session_id + description: Unique session identifier + - name: message_count + description: Number of messages in the session + - name: total_usage + description: Sum of usage for the session +``` + +Test it locally, commit, push. Your scheduled job picks it up automatically. + +## Handling Incremental Loads + +For large tables, you don't want to rebuild everything every hour. dbt supports incremental models that only process new data: + +```sql +{{ + config( + materialized='incremental', + unique_key='event_id', + incremental_strategy='merge' + ) +}} + +select + message_id as event_id, + user_id, + timestamp, + usage +from {{ ref('stg_events') }} + +{% if is_incremental() %} +where timestamp > (select max(timestamp) from {{ this }}) +{% endif %} +``` + +The agent knows these patterns. When you ask for a model that handles "millions of events per day," it will suggest incremental materialization. + +## Real-Time and Streaming Pipelines + +Not everything fits the hourly batch model. Sometimes you need data loaded continuously. + +The approach varies by platform: + + + + - Kafka/Confluent with warehouse connectors + - AWS Kinesis to Redshift/S3 + - Google Pub/Sub to BigQuery + - Segment, Rudderstack, or similar CDPs + + + - AWS Lambda triggered by S3 events + - Google Cloud Functions triggered by GCS + - Azure Functions with Event Grid + + + +Example prompt for building a loader: + +``` +You: Create a serverless function that loads new event files from +cloud storage into our warehouse. It should discover all source +folders automatically and load the JSON files incrementally. +``` + +The agent generates code appropriate for your cloud: + +```python +def load_events(event, context): + # 1. List new files in storage bucket + new_files = list_files_since_last_run(bucket, prefix='events/') + + # 2. Load each file into warehouse + for file in new_files: + data = read_json_file(file) + insert_to_warehouse(data, table='events_raw') + + # 3. Record checkpoint for next run + update_checkpoint(new_files[-1].timestamp) + + return {"loaded": len(new_files)} +``` + +## Debugging Pipeline Issues + +Pipelines break. Data sources change. Schemas drift. The agent can help diagnose issues. + +``` +You: The fct_sessions table hasn't updated since yesterday. +What's wrong? +``` + +The agent might: +1. Check the transformation job logs (dbt Cloud, Dagster, Airflow, etc.) +2. Query for recent data in upstream tables +3. Look for schema changes in source tables +4. Check loader function execution logs + +``` +Agent: I found the issue. The stg_events model failed because +events_raw has a new field 'context.device.type' that's sometimes +an object instead of a string. The JSON parsing is failing. + +Here's a fix that handles both cases: + +[shows SQL fix] + +Want me to apply this and re-run the pipeline? +``` + +This diagnostic workflow used to take 30 minutes of clicking through logs. Now it's a 2-minute conversation. + +## Pipeline Best Practices + + + + Keep a reference organization with predictable data for testing: + ```sql + SELECT * FROM {{ ref('fct_sessions') }} + WHERE organization_id = 'test-org-12345' + ``` + + + + dbt has built-in testing: + ```yaml + models: + - name: fct_sessions + columns: + - name: session_id + tests: + - unique + - not_null + - name: message_count + tests: + - dbt_utils.accepted_range: + min_value: 1 + ``` + + + + Future you will thank present you: + ```yaml + models: + - name: fct_sessions + description: | + Aggregated session metrics from events. + + Updated hourly via scheduled job. + + Note: Only includes sessions with at least one message. + ``` + + + + When you make breaking changes, create a new model: + ``` + fct_sessions # current version + fct_sessions_v2 # new schema, running in parallel + ``` + Migrate consumers, then deprecate the old version. + + + +## What's Next + +[Part 5](/guides/data/migrations) covers data migrations: schema mapping, transformation scripts, validation queries, and coordinating cutovers. diff --git a/docs/guides/data/repeated-reports.mdx b/docs/guides/data/repeated-reports.mdx new file mode 100644 index 0000000..6e821c4 --- /dev/null +++ b/docs/guides/data/repeated-reports.mdx @@ -0,0 +1,253 @@ +--- +title: "Part 3: Repeated Reports" +description: Weekly engineering reports, flaky test tracking, SaaS replacement analysis, and scheduling. +keywords: ['reports', 'automation', 'slack', 'scheduling', 'cron', 'github-actions'] +--- + +Ad-hoc analysis is great for one-off questions. But some analyses need to happen regularly: weekly metrics, monthly reports, daily monitors. + +The traditional approach is building dashboards or scheduling SQL queries. Both work, but they're rigid. Dashboards require a BI tool and maintenance. Scheduled queries run whether you need them or not. + +Agent-native reporting is more flexible. You describe what you want, and the agent generates a script that can run on any schedule. When requirements change, you describe the change and the script updates. + +## The Weekly Engineering Report + +Here's a real example. Every Monday, we want a summary of platform health to share in Slack: + +- Key metrics (DAU, MAU, total sessions) +- Week-over-week changes +- Any anomalies worth investigating + +Instead of building a dashboard, we asked the agent to create a script: + +``` +You: Create a script that generates a weekly platform health report. +It should show DAU, MAU, sessions, and usage for the past week +compared to the previous week. Format it for Slack. +``` + +The agent produced a Python script that: +1. Queries the warehouse for current and previous week metrics +2. Calculates week-over-week changes +3. Flags any metrics that changed more than 20% +4. Formats everything as a Slack message + +```python +# weekly_platform_report.py +import pandas as pd + +def get_weekly_metrics(conn, weeks_ago=0): + query = f""" + SELECT + COUNT(DISTINCT user_id) as dau_avg, + COUNT(DISTINCT CASE WHEN days_active >= 7 THEN user_id END) as wau, + SUM(sessions_count) as total_sessions, + SUM(usage_amount) as total_usage + FROM analytics.fct_daily_metrics + WHERE date BETWEEN + CURRENT_DATE - INTERVAL '{7 + (weeks_ago * 7)} days' AND + CURRENT_DATE - INTERVAL '{1 + (weeks_ago * 7)} days' + """ + return pd.read_sql(query, conn).iloc[0] + +def format_slack_message(current, previous): + def change_emoji(pct): + if pct > 10: return "📈" + if pct < -10: return "📉" + return "➡️" + + metrics = [] + for name in ['dau_avg', 'wau', 'total_sessions', 'total_usage']: + curr = current[name] + prev = previous[name] + pct = ((curr - prev) / prev * 100) if prev else 0 + metrics.append(f"{change_emoji(pct)} {name}: {curr:,} ({pct:+.1f}%)") + + return "*Weekly Platform Report*\n" + "\n".join(metrics) + +if __name__ == "__main__": + # Connection setup varies by warehouse + conn = create_connection() + current = get_weekly_metrics(conn, 0) + previous = get_weekly_metrics(conn, 1) + print(format_slack_message(current, previous)) +``` + +Now this runs every Monday via cron, and posts to Slack via webhook. + +## Flaky Test Reports + +Engineering teams often want to track flaky tests. The data usually lives in CI/CD logs, which you can pull via API. + +The workflow: +1. Query GitHub Actions API for recent test runs +2. Identify tests that sometimes pass and sometimes fail +3. Generate a report showing the worst offenders +4. Email the relevant team + +``` +You: Create a script that identifies flaky tests from our GitHub Actions +runs. A test is flaky if it failed at least once but also passed at +least once in the past 7 days. Rank by flakiness rate and email the +report to the platform team. +``` + +The agent can generate this by combining: +- GitHub API calls to get workflow runs +- Parsing test result artifacts +- Calculating flakiness metrics +- Sending via your email provider (SendGrid, SES, or even Gmail CLI) + +The key insight is that you don't need to pre-build all this infrastructure. Describe the workflow you want, and let the agent figure out the implementation. + +## SaaS Replacement Analysis + +Here's a more interesting use case: figuring out which SaaS tools you're paying for that could be replaced with internal software built by coding agents. + +If you're using Ramp, Brex, Expensify, or similar tools, they have APIs. Same goes for cloud cost data from AWS Cost Explorer, GCP Billing, or Azure Cost Management. + +``` +You: Pull our engineering SaaS subscriptions from Ramp. For each tool +over $500/month, analyze whether we could build a replacement internally +using coding agents. Score each on: +- Feasibility (how hard would it be to build?) +- Time estimate (days/weeks to functional replacement) +- Risk (what could go wrong, compliance concerns, etc.) +- ROI (break-even timeline given our spend) + +Focus on tools where we're only using 20-30% of features. +``` + +This produces an analysis that looks at each vendor and actually reasons about replaceability: + +| Tool | Monthly Cost | Feasibility | Build Time | Risk | ROI | +|------|-------------|-------------|------------|------|-----| +| Internal wiki | $2,400 | High | 3 days | Low - just markdown files | 1 month | +| Status page | $800 | High | 1 day | Low - static site + monitoring | 1 month | +| Feature flags | $1,200 | Medium | 1 week | Medium - need rollout testing | 2 months | +| Error tracking | $3,600 | Low | 2+ weeks | High - Sentry is very mature | 6+ months | + +The agent can look at what each tool actually does, estimate complexity based on similar open source projects, and flag risks like compliance requirements or integration complexity. A human should review these recommendations, but having the analysis generated automatically turns a multi-day research project into a conversation. + +## Patterns for Report Scripts + +A few patterns make these scripts more maintainable: + + + + ```python + def fetch_metrics(): + # query logic here + return data + + def format_report(data, format_type): + if format_type == "slack": + return format_for_slack(data) + elif format_type == "email": + return format_for_email(data) + elif format_type == "pdf": + return format_for_pdf(data) + ``` + + + + ```python + if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--dry-run", action="store_true") + args = parser.parse_args() + + report = generate_report() + + if args.dry_run: + print(report) + else: + send_to_slack(report) + ``` + + + + ```python + import logging + logging.basicConfig(level=logging.INFO) + logger = logging.getLogger(__name__) + + logger.info(f"Fetching metrics for {date_range}") + logger.info(f"Found {len(results)} records") + logger.info(f"Sending report to {channel}") + ``` + + + + ```python + try: + data = fetch_from_api() + except APIError as e: + logger.error(f"API call failed: {e}") + send_alert("Report generation failed", str(e)) + sys.exit(1) + ``` + + + +## Scheduling and Delivery + +Once you have a script, you need to run it. Options: + + + + ```bash + # Every Monday at 9am + 0 9 * * 1 /path/to/venv/bin/python /path/to/weekly_report.py + ``` + + + ```yaml + name: Weekly Report + on: + schedule: + - cron: '0 9 * * 1' + jobs: + report: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - run: pip install -r requirements.txt + - run: python scripts/weekly_report.py + env: + SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }} + ``` + + + ```python + # AWS Lambda, Google Cloud Functions, Azure Functions, etc. + def generate_report(event, context): + report = build_report() + send_to_slack(report) + return {"statusCode": 200} + ``` + + + For complex workflows, use Dagster, Prefect, Airflow, or Temporal. Great when reports depend on data freshness checks or have multiple steps. + + + +The agent can help set up any of these. Just describe where you want it to run. + +## Iterating on Reports + +Reports evolve. Someone will ask "can we also include X?" or "can we change the format to Y?" + +This is where agent-native development shines. You don't edit the script manually. You describe the change: + +``` +You: Update the weekly report to also include our top 5 growing +accounts by usage. Show their name, this week's usage, +and the percentage increase from last week. +``` + +The agent modifies the script, you review the changes, and you're done. The barrier to iterating is low enough that reports actually improve over time instead of becoming stale. + +## What's Next + +[Part 4](/guides/data/pipelines) covers building data pipelines: SQL-based transformations, incremental loads, and debugging pipeline issues.