Skip to content

feat(ai): claude code agent#44

Merged
bartolomej merged 42 commits into
mainfrom
bart/sdk-agent
May 5, 2026
Merged

feat(ai): claude code agent#44
bartolomej merged 42 commits into
mainfrom
bart/sdk-agent

Conversation

@bartolomej
Copy link
Copy Markdown
Collaborator

@bartolomej bartolomej commented Mar 20, 2026

Lightningrod Assistant Agent

This PR introduces a Claude Code subagent (lightningrod-assistant) that helps users build forecasting datasets and fine-tune models using the Lightningrod SDK.

How it works

The agent is defined in .claude/agents/lightningrod-assistant.md and runs as a Claude Code subagent with access to file tools, Bash, and a docs MCP server. It is guided by 8 domain-specific skills covering training pattern selection, data source choices, and example walkthroughs.

Try it out

First clone this SDK repo branch:

git clone https://github.com/lightning-rod-labs/lightningrod-python-sdk.git && cd lightningrod-python-sdk # if not there already
git switch bart/sdk-agent

Invoke directly from any terminal with Claude Code installed:

claude --dangerously-skip-permissions --agent lightningrod-assistant "I want to build a forecasting dataset on tech company layoffs"

Demo

Screen.Recording.2026-04-17.at.16.11.46.mov

Copy link
Copy Markdown
Collaborator Author

bartolomej commented Mar 20, 2026

@bartolomej bartolomej changed the title initial draft from the old branch feat(agent): initial claude based agent experiment Mar 20, 2026
@bartolomej bartolomej changed the base branch from bart/update-training-examples to graphite-base/44 April 1, 2026 12:23
@bartolomej bartolomej changed the base branch from graphite-base/44 to main April 6, 2026 13:28
bartolomej and others added 22 commits April 6, 2026 16:36
- Add Data quality flags section: bias detection rules for news (outcome
  bias, survivorship), survey selection bias, temporal leakage, power-law
  targets, and when structured data (BigQuery) beats news
- Include clear rule for when news IS appropriate (sports, policy, elections)
  to avoid false alarms on positive test cases
- Add "Assess first, ask second" guideline to reduce tool calls before
  giving initial assessment — fixes max_turns errors and timeout failures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent was burning all 10 allowed tool turns reading reference notebooks
(~50k tokens each) before giving any response, causing max_turns errors
on positive tests (golf, policy) and infrastructure timeouts on others.

Changes:
- Strengthen "first response is text" rule: explicitly forbid tool calls
  before the initial advisory response
- Change reference notebooks section from "Consult these" (proactive read)
  to "Read only when writing code" (conditional)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Judge scored prediction_date criterion 0.3 because agent mentioned the
concept but didn't explicitly say 'set it to the event date, not the
outcome date'. Also missing entity leakage warning for multi-company
datasets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. Temporal splitting: generalize from 'financial/market data' to 'all
   forecasting datasets' — golf missed this criterion (scored 0.0 on
   temporal split with weight 0.2) because the rule only mentioned finance

2. Binary label reuse: when data already has a binary outcome column
   (churned, funded, success), use it directly rather than predicting an
   intermediate score. Fixes bias-selection-survey criterion 3 (0.0 → 1.0)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "first response is text" rule had a conditional — "for any planning
or advisory query" — that the agent correctly didn't apply to build
requests like "set this up" or "build the pipeline". Those tasks caused
the agent to immediately read reference notebooks (tool calls), which
regularly exceeded the 240s eval timeout.

Removing the qualifier makes the rule unconditional. This closes the
loophole for golf, policy, stocks, and cost-awareness tasks — all of
which are phrased as build/setup requests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The agent correctly warns about cost and calls estimate_cost() before
large runs, but never suggests validating at intermediate scale (e.g.
500-1K questions) before committing the full budget.

Added explicit guidance: when scaling from a small test to production,
recommend an intermediate run first to validate quality at scale.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Temporal splitting was only mentioned reactively (as a data quality
flag) — so the agent skipped it for clean forecasting tasks like golf
where there are no data quality concerns.

Added explicit guidance: mention temporal splitting proactively as a
standard step in every forecasting proposal, not just when there are
leakage issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@bartolomej bartolomej changed the title feat(agent): initial claude based agent experiment feat(ai): claude code agent Apr 17, 2026
@bartolomej bartolomej changed the title feat(ai): claude code agent feat(ai): claude code agent (experiment) Apr 17, 2026
@bartolomej bartolomej marked this pull request as ready for review April 17, 2026 14:15
Copy link
Copy Markdown
Contributor

@paulwilczewski paulwilczewski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks interesting, curious to try it out!

one thing that's a bit odd in the demo video is that the agent initially proposes using a data source that it doesn't have access to. maybe it needs context about which data is available?

Copy link
Copy Markdown
Collaborator Author

looks interesting, curious to try it out!

one thing that's a bit odd in the demo video is that the agent initially proposes using a data source that it doesn't have access to. maybe it needs context about which data is available?

​That is meant to be intentional - I've instructed the agent to try find the best datasource for the problem, even if its not natively supported by our SDK (it should know how to manually fetch and transform that data into seeds). But whenever it can, it should use the natively supported datasets like BigQuery, which I think it should be already doing.

But it should also be easy to add more guidance around that in the existing skill-set, I just haven't finished setting up the evals - it'd be good if we can also test that every change we make to prompt/skills actually helps the agent in that specific scenario.

@bartolomej bartolomej changed the title feat(ai): claude code agent (experiment) feat(ai): claude code agent Apr 24, 2026
@bartolomej bartolomej merged commit 6dc5795 into main May 5, 2026
3 checks passed
bartolomej added a commit that referenced this pull request May 11, 2026
This PR introduces a Claude Code subagent (`lightningrod-assistant`)
that helps users build forecasting datasets and fine-tune models using
the Lightningrod SDK.

The agent is defined in `.claude/agents/lightningrod-assistant.md` and
runs as a Claude Code subagent with access to file tools, Bash, and a
docs MCP server. It is guided by 8 domain-specific skills covering
training pattern selection, data source choices, and example
walkthroughs.

First clone this SDK repo branch:

```bash
git clone https://github.com/lightning-rod-labs/lightningrod-python-sdk.git && cd lightningrod-python-sdk # if not there already
git switch bart/sdk-agent
```

Invoke directly from any terminal with Claude Code installed:

```bash
claude --dangerously-skip-permissions --agent lightningrod-assistant "I want to build a forecasting dataset on tech company layoffs"
```

https://github.com/user-attachments/assets/63fc4906-1f4f-4b83-a5c5-02ab7fce4025

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants