feat(ai): claude code agent#44
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
489add5 to
c252e6f
Compare
b3c7f32 to
4625f2e
Compare
…progress plot script
- Add Data quality flags section: bias detection rules for news (outcome bias, survivorship), survey selection bias, temporal leakage, power-law targets, and when structured data (BigQuery) beats news - Include clear rule for when news IS appropriate (sports, policy, elections) to avoid false alarms on positive test cases - Add "Assess first, ask second" guideline to reduce tool calls before giving initial assessment — fixes max_turns errors and timeout failures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent was burning all 10 allowed tool turns reading reference notebooks (~50k tokens each) before giving any response, causing max_turns errors on positive tests (golf, policy) and infrastructure timeouts on others. Changes: - Strengthen "first response is text" rule: explicitly forbid tool calls before the initial advisory response - Change reference notebooks section from "Consult these" (proactive read) to "Read only when writing code" (conditional) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Judge scored prediction_date criterion 0.3 because agent mentioned the concept but didn't explicitly say 'set it to the event date, not the outcome date'. Also missing entity leakage warning for multi-company datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Temporal splitting: generalize from 'financial/market data' to 'all forecasting datasets' — golf missed this criterion (scored 0.0 on temporal split with weight 0.2) because the rule only mentioned finance 2. Binary label reuse: when data already has a binary outcome column (churned, funded, success), use it directly rather than predicting an intermediate score. Fixes bias-selection-survey criterion 3 (0.0 → 1.0) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "first response is text" rule had a conditional — "for any planning or advisory query" — that the agent correctly didn't apply to build requests like "set this up" or "build the pipeline". Those tasks caused the agent to immediately read reference notebooks (tool calls), which regularly exceeded the 240s eval timeout. Removing the qualifier makes the rule unconditional. This closes the loophole for golf, policy, stocks, and cost-awareness tasks — all of which are phrased as build/setup requests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The agent correctly warns about cost and calls estimate_cost() before large runs, but never suggests validating at intermediate scale (e.g. 500-1K questions) before committing the full budget. Added explicit guidance: when scaling from a small test to production, recommend an intermediate run first to validate quality at scale. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Temporal splitting was only mentioned reactively (as a data quality flag) — so the agent skipped it for clean forecasting tasks like golf where there are no data quality concerns. Added explicit guidance: mention temporal splitting proactively as a standard step in every forecasting proposal, not just when there are leakage issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This reverts commit 8b6332e.
paulwilczewski
left a comment
There was a problem hiding this comment.
looks interesting, curious to try it out!
one thing that's a bit odd in the demo video is that the agent initially proposes using a data source that it doesn't have access to. maybe it needs context about which data is available?
That is meant to be intentional - I've instructed the agent to try find the best datasource for the problem, even if its not natively supported by our SDK (it should know how to manually fetch and transform that data into seeds). But whenever it can, it should use the natively supported datasets like BigQuery, which I think it should be already doing. But it should also be easy to add more guidance around that in the existing skill-set, I just haven't finished setting up the evals - it'd be good if we can also test that every change we make to prompt/skills actually helps the agent in that specific scenario. |
This PR introduces a Claude Code subagent (`lightningrod-assistant`) that helps users build forecasting datasets and fine-tune models using the Lightningrod SDK. The agent is defined in `.claude/agents/lightningrod-assistant.md` and runs as a Claude Code subagent with access to file tools, Bash, and a docs MCP server. It is guided by 8 domain-specific skills covering training pattern selection, data source choices, and example walkthroughs. First clone this SDK repo branch: ```bash git clone https://github.com/lightning-rod-labs/lightningrod-python-sdk.git && cd lightningrod-python-sdk # if not there already git switch bart/sdk-agent ``` Invoke directly from any terminal with Claude Code installed: ```bash claude --dangerously-skip-permissions --agent lightningrod-assistant "I want to build a forecasting dataset on tech company layoffs" ``` https://github.com/user-attachments/assets/63fc4906-1f4f-4b83-a5c5-02ab7fce4025 --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Lightningrod Assistant Agent
This PR introduces a Claude Code subagent (
lightningrod-assistant) that helps users build forecasting datasets and fine-tune models using the Lightningrod SDK.How it works
The agent is defined in
.claude/agents/lightningrod-assistant.mdand runs as a Claude Code subagent with access to file tools, Bash, and a docs MCP server. It is guided by 8 domain-specific skills covering training pattern selection, data source choices, and example walkthroughs.Try it out
First clone this SDK repo branch:
Invoke directly from any terminal with Claude Code installed:
claude --dangerously-skip-permissions --agent lightningrod-assistant "I want to build a forecasting dataset on tech company layoffs"Demo
Screen.Recording.2026-04-17.at.16.11.46.mov