Add D4D AI Assistant GitHub integration (#55)

justaddcoffee · claude · web-flow · commit 49afe364c1c8 · 2025-11-05T15:57:28.000-05:00
* Add D4D AI Assistant GitHub integration Set up @d4dassistant GitHub AI agent for automatic D4D generation. Features: - GitHub Action workflow that triggers on @d4dassistant mentions - Claude Code agent (via dragon-ai-agent/run-goose-obo) - Automated D4D YAML generation from dataset documentation - Schema validation before PR creation - Unique timestamp-based ID generation to avoid conflicts Files added: - .github/ai-controllers.json - Authorized users list - .github/workflows/d4d-agent.yml - GitHub Action workflow - .goosehints - Agent instructions and workflow - .github/D4D_ASSISTANT_README.md - User documentation Generated D4Ds are saved to: html-demos/user_d4ds/ Usage: @d4dassistant Create D4D for https://example.com/dataset 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add goose config file * Switch D4D assistant from Goose to Claude Code Replace run-goose-obo action with run-claude-obo@v1.0.2 to use Claude Code directly. Configure CBORG API endpoint via .claude/settings.json for consistent model access across environments. Changes: - Update workflow to use dragon-ai-agent/run-claude-obo@v1.0.2 - Remove openai-api-key parameter (Claude-only) - Add Claude Code configuration parameters - Create .claude/settings.json with CBORG base URL and model settings - Remove obsolete .config/goose/ directory 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
diff --git a/.claude/settings.json b/.claude/settings.json
@@ -0,0 +1,20 @@
+{
+  "model": "anthropic/claude-sonnet",
+  "apiKeyHelper": "echo $ANTHROPIC_API_KEY",
+  "env": {
+    "ANTHROPIC_BASE_URL": "https://api.cborg.lbl.gov",
+    "ANTHROPIC_MODEL": "anthropic/claude-sonnet",
+    "ANTHROPIC_SMALL_FAST_MODEL": "anthropic/claude-haiku",
+    "DISABLE_NON_ESSENTIAL_MODEL_CALLS": "1"
+  },
+  "permissions": {
+    "allow": [
+      "Bash(git:*)",
+      "Bash(gh:*)",
+      "Bash(poetry:*)",
+      "Bash(make:*)",
+      "Bash(python:*)",
+      "Bash(uv:*)"
+    ]
+  }
+}
diff --git a/.github/D4D_ASSISTANT_README.md b/.github/D4D_ASSISTANT_README.md
@@ -0,0 +1,135 @@
+# D4D AI Assistant
+
+This repository has an AI assistant (`@d4dassistant`) that can automatically generate D4D (Datasheets for Datasets) YAML files from dataset documentation.
+
+## How to Use
+
+### Request D4D Generation
+
+Authorized users (listed in `.github/ai-controllers.json`) can mention `@d4dassistant` in GitHub issues to request D4D generation:
+
+```markdown
+@d4dassistant Please create a D4D for this dataset: https://example.com/dataset-page
+
+Additional context: This dataset contains medical imaging data for cancer research...
+```
+
+### What the Assistant Does
+
+1. **Analyzes** your dataset description and any provided URLs
+2. **Fetches** documentation from web pages, PDFs, or repositories
+3. **Generates** a valid D4D YAML file conforming to the LinkML schema
+4. **Validates** the YAML against the schema
+5. **Creates** a pull request with the D4D file in `html-demos/user_d4ds/`
+6. **Comments** on your issue with a link to the PR
+
+### Example Requests
+
+**With URL:**
+```markdown
+@d4dassistant Create a D4D for the Bridge2AI VOICE dataset
+
+URL: https://physionet.org/content/b2ai-voice/
+This is a voice biomarker dataset for health research.
+```
+
+**With description only:**
+```markdown
+@d4dassistant Generate a D4D for my diabetes study dataset
+
+Dataset name: T2D Longitudinal Study
+Description: 5-year longitudinal study of 1000 Type 2 diabetes patients
+Format: CSV files with clinical measurements and lab results
+License: CC-BY-4.0
+```
+
+**With GitHub repository:**
+```markdown
+@d4dassistant Create D4D from this repo: https://github.com/org/dataset-repo
+
+The README has all the dataset details.
+```
+
+## What Information to Provide
+
+The more information you provide, the better the D4D will be. Useful information includes:
+
+- **URLs**: Dataset landing pages, documentation, PDFs, GitHub repos
+- **Dataset name**: Short and descriptive
+- **Description**: What the dataset contains and why it exists
+- **Creators**: Who created/maintains the dataset
+- **Size**: Number of instances, file size
+- **Format**: CSV, JSON, Parquet, etc.
+- **License**: How the data can be used
+- **Collection details**: How and when data was gathered
+- **Use cases**: What tasks it's intended for
+
+## What Gets Generated
+
+The assistant creates a YAML file following the D4D schema with sections like:
+
+- **Motivation**: Why the dataset was created
+- **Composition**: What it contains (instances, splits, etc.)
+- **Collection**: How data was gathered
+- **Preprocessing**: Data cleaning steps
+- **Uses**: Recommended and discouraged applications
+- **Distribution**: Access information and licensing
+- **Maintenance**: Who maintains it and how to get support
+
+## File Location
+
+Generated D4D files are saved to: `html-demos/user_d4ds/{dataset_name}_d4d.yaml`
+
+Each filename includes a timestamp or unique identifier to avoid conflicts.
+
+## Reviewing the Generated D4D
+
+Once the PR is created:
+
+1. Review the generated YAML file
+2. Check that metadata is accurate
+3. Request changes if needed (comment on the PR)
+4. Merge when satisfied
+
+The assistant can update the D4D based on your feedback - just comment on the PR with your requested changes.
+
+## Authorization
+
+To add users who can invoke the assistant, edit `.github/ai-controllers.json`:
+
+```json
+["username1", "username2", "username3"]
+```
+
+Only authorized users can trigger the assistant by mentioning `@d4dassistant`.
+
+## Technical Details
+
+- **Agent**: Powered by Claude Code via `dragon-ai-agent/run-goose-obo` GitHub Action
+- **Schema**: Uses LinkML schema from `src/data_sheets_schema/schema/`
+- **Validation**: Runs `make test-examples` to ensure schema compliance
+- **Examples**: References `src/data/examples/valid/` for guidance
+
+## Troubleshooting
+
+**Assistant didn't respond:**
+- Check that you're in the authorized users list
+- Ensure you mentioned `@d4dassistant` (not `@d4d-assistant` or similar)
+- Check GitHub Actions logs for errors
+
+**Generated D4D is incomplete:**
+- Provide more information in a follow-up comment
+- Share additional URLs or documentation
+- The assistant can update the D4D based on new info
+
+**Validation errors:**
+- The assistant should fix these automatically
+- If the PR has validation errors, comment with details
+- The assistant will update the PR
+
+## Support
+
+For issues or questions:
+- Open a GitHub issue
+- Tag authorized users for assistance
+- Check `.goosehints` file for assistant instructions
diff --git a/.github/ai-controllers.json b/.github/ai-controllers.json
@@ -0,0 +1 @@
+["justaddcoffee", "monicacecilia", "caufieldjh", "realmarcin", "jniestroy"]
diff --git a/.github/workflows/d4d-agent.yml b/.github/workflows/d4d-agent.yml
@@ -0,0 +1,200 @@
+name: D4D AI Assistant GitHub Mentions
+
+on:
+  issues:
+    types: [opened, edited]
+  issue_comment:
+    types: [created, edited]
+  pull_request:
+    types: [opened, edited]
+  pull_request_review_comment:
+    types: [created, edited]
+  workflow_dispatch:
+    inputs:
+      item-type:
+        description: 'Type of item (issue or pull_request)'
+        required: true
+        type: choice
+        options:
+          - issue
+          - pull_request
+      item-number:
+        description: 'Issue or PR number'
+        required: true
+        type: number
+
+jobs:
+  check-mention:
+    runs-on: ubuntu-latest
+    outputs:
+      qualified-mention: ${{ steps.detect.outputs.qualified-mention }}
+      prompt: ${{ steps.detect.outputs.prompt }}
+      user: ${{ steps.detect.outputs.user }}
+      item-type: ${{ steps.detect.outputs.item-type }}
+      item-number: ${{ steps.detect.outputs.item-number }}
+      controllers: ${{ steps.detect.outputs.controllers }}
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Detect AI mention
+        id: detect
+        uses: actions/github-script@v7
+        with:
+          github-token: ${{ secrets.PAT_FOR_PR }}
+          script: |
+            // Load allowed users from config
+            const fs = require('fs');
+            let allowedUsers = [];
+            try {
+              const configContent = fs.readFileSync('.github/ai-controllers.json', 'utf8');
+              allowedUsers = JSON.parse(configContent);
+            } catch (error) {
+              console.log('Error loading allowed users:', error);
+              // Use fallback controllers if provided
+              const fallback = 'jtr4v';
+              allowedUsers = fallback ? fallback.split(',').map(u => u.trim()) : [];
+            }
+
+            // Get content and user from event payload
+            let content = '';
+            let userLogin = '';
+            let itemType = '';
+            let itemNumber = 0;
+
+            if (context.eventName === 'workflow_dispatch') {
+              // Manual trigger - fetch the issue/PR from GitHub API
+              itemType = context.payload.inputs['item-type'];
+              itemNumber = parseInt(context.payload.inputs['item-number']);
+              userLogin = context.actor; // Use the person who triggered the workflow
+
+              if (itemType === 'issue') {
+                // First check issue body
+                const issue = await github.rest.issues.get({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  issue_number: itemNumber
+                });
+                content = issue.data.body || '';
+
+                // If no @d4dassistant in body, check comments
+                if (!content.includes('@d4dassistant')) {
+                  const comments = await github.rest.issues.listComments({
+                    owner: context.repo.owner,
+                    repo: context.repo.repo,
+                    issue_number: itemNumber
+                  });
+                  // Find the most recent comment with @d4dassistant
+                  for (let i = comments.data.length - 1; i >= 0; i--) {
+                    if (comments.data[i].body && comments.data[i].body.includes('@d4dassistant')) {
+                      content = comments.data[i].body;
+                      break;
+                    }
+                  }
+                }
+              } else if (itemType === 'pull_request') {
+                const pr = await github.rest.pulls.get({
+                  owner: context.repo.owner,
+                  repo: context.repo.repo,
+                  pull_number: itemNumber
+                });
+                content = pr.data.body || '';
+
+                // If no @d4dassistant in body, check comments
+                if (!content.includes('@d4dassistant')) {
+                  const comments = await github.rest.issues.listComments({
+                    owner: context.repo.owner,
+                    repo: context.repo.repo,
+                    issue_number: itemNumber
+                  });
+                  // Find the most recent comment with @d4dassistant
+                  for (let i = comments.data.length - 1; i >= 0; i--) {
+                    if (comments.data[i].body && comments.data[i].body.includes('@d4dassistant')) {
+                      content = comments.data[i].body;
+                      break;
+                    }
+                  }
+                }
+              }
+            } else if (context.eventName === 'issues') {
+              content = context.payload.issue.body || '';
+              userLogin = context.payload.issue.user.login;
+              itemType = 'issue';
+              itemNumber = context.payload.issue.number;
+            } else if (context.eventName === 'pull_request') {
+              content = context.payload.pull_request.body || '';
+              userLogin = context.payload.pull_request.user.login;
+              itemType = 'pull_request';
+              itemNumber = context.payload.pull_request.number;
+            } else if (context.eventName === 'issue_comment') {
+              content = context.payload.comment.body || '';
+              userLogin = context.payload.comment.user.login;
+              itemType = 'issue';
+              itemNumber = context.payload.issue.number;
+            } else if (context.eventName === 'pull_request_review_comment') {
+              content = context.payload.comment.body || '';
+              userLogin = context.payload.comment.user.login;
+              itemType = 'pull_request';
+              itemNumber = context.payload.pull_request.number;
+            }
+
+            // Check if user is allowed and mention exists
+            const isAllowed = allowedUsers.includes(userLogin);
+            const mentionRegex = new RegExp('@d4dassistant\\s+(.*)', 'i');
+            const mentionMatch = content.match(mentionRegex);
+
+            const qualifiedMention = isAllowed && mentionMatch !== null;
+            const prompt = qualifiedMention ? mentionMatch[1].trim() : '';
+
+            console.log(`User: ${userLogin}, Allowed: ${isAllowed}, Has mention: ${mentionMatch !== null}, Content: "${content}"`);
+
+            // Set outputs
+            core.setOutput('qualified-mention', qualifiedMention);
+            core.setOutput('prompt', prompt);
+            core.setOutput('user', userLogin);
+            core.setOutput('item-type', itemType);
+            core.setOutput('item-number', itemNumber);
+            core.setOutput('controllers', allowedUsers.map(u => '@' + u).join(', '));
+
+            return {
+              qualifiedMention,
+              itemType,
+              itemNumber,
+              prompt,
+              user: userLogin,
+              controllers: allowedUsers.map(u => '@' + u).join(', ')
+            };
+
+  respond-to-mention:
+    needs: check-mention
+    if: needs.check-mention.outputs.qualified-mention == 'true'
+    permissions:
+      contents: write
+      pull-requests: write
+      issues: write
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          token: ${{ secrets.PAT_FOR_PR }}
+
+      - name: Respond with AI Agent
+        uses: dragon-ai-agent/run-claude-obo@v1.0.2
+        with:
+          anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }}
+          github-token: ${{ secrets.PAT_FOR_PR }}
+          prompt: ${{ needs.check-mention.outputs.prompt }}
+          user: ${{ needs.check-mention.outputs.user }}
+          item-type: ${{ needs.check-mention.outputs.item-type }}
+          item-number: ${{ needs.check-mention.outputs.item-number }}
+          controllers: ${{ needs.check-mention.outputs.controllers }}
+          agent-name: 'd4dassistant'
+          branch-prefix: 'd4dassistant'
+          robot-version: 'v1.9.7'
+          enable-robot: 'true'
+          enable-obo-scripts: 'true'
+          enable-python-tools: 'true'
+          python-packages: 'aurelian jinja2-cli "wrapt>=1.17.2"'
+          claude-allowed-tools: '["Bash(git:*)", "Bash(gh:*)", "FileSystem(*)"]'
diff --git a/.goosehints b/.goosehints
diff --git a/ASSISTANT_PR_STATUS.md b/ASSISTANT_PR_STATUS.md

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+["justaddcoffee", "monicacecilia", "caufieldjh", "realmarcin", "jniestroy"]`