Skip to content

Conversation

@realmarcin
Copy link
Collaborator

No description provided.

realmarcin and others added 2 commits December 3, 2025 23:12
This commit adds four detailed analysis documents investigating the root
causes of D4D validation failures:

1. SCHEMA_COMPLIANCE_REPORT.md - Complete LinkML schema validation
   analysis confirming schemas are 100% compliant. Documents that all
   validation errors stem from missing 'id' fields in generated data,
   not schema design issues. Provides 4 solution options with pros/cons
   and phased implementation recommendations.

2. ROOT_CAUSE_ANALYSIS.md - Deep dive into why GPT-5 omits required
   'id' fields. Proves that GPT-5 correctly uses 'response' field (as
   per schema definition) but fails to generate IDs because prompts
   lack explicit ID generation guidance. Documents inheritance chain:
   Purpose/Task → DatasetProperty → NamedThing (where id is required).

3. PROMPT_COMPARISON.md - Side-by-side comparison of Aurelian agent
   (GPT-5) and Claude Code deterministic prompts. Shows prompts are
   similar but not identical, with neither explicitly specifying field
   names or ID generation patterns. Proves prompt differences are not
   the root cause of validation failures.

4. SCHEMA_FIXES_REPORT.md - Documents schema enhancements made to
   support nested dataset resources. Added 'resources' field to Dataset
   class with inlined_as_list support. Fixed VOICE missing top-level
   metadata. Shows remaining errors are data quality issues, not schema
   structure problems.

Key findings:
- Schemas validate successfully (114 warnings, 0 errors)
- 80 validation errors across 4 projects due to missing 'id' fields
- LLMs omit IDs because prompts say "only populate what you're sure about"
- Solution options range from quick fix (make ID optional) to systematic
  fix (post-process YAML to auto-generate IDs)

These reports provide complete context for deciding on validation fix
strategy and serve as documentation for future D4D generation improvements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit implements Phase 1 of making D4D prompts interchangeable between
the Aurelian agent (GPT-5) and Claude Code deterministic approaches.

New Infrastructure:

1. D4DPromptLoader class (src/download/prompt_loader.py)
   - Unified interface for loading D4D prompts from external files
   - Supports multiple prompt sets: "shared", "aurelian", "claude"
   - Configurable schema source with LOCAL as default (per user requirement)
   - Schema caching and SHA-256 hashing for provenance
   - Full metadata generation for reproducibility
   - Backward compatible helper functions

2. Unified Prompt Set (src/download/prompts/shared/)
   - d4d_system_prompt.txt - Best-of-both approach combining:
     * Aurelian's clarity and simplicity
     * Claude's detailed 17-point extraction checklist
     * Mode-agnostic (works for URL mode and content mode)
     * **Includes explicit ID generation guidance** (addresses validation issues!)

   - d4d_user_prompt_content_mode.txt - For pre-loaded content
   - d4d_user_prompt_url_mode.txt - For URL-based agent tool workflow
   - prompt_versions.yaml - Comprehensive version tracking with:
     * Version history and changelog
     * Compatibility matrix
     * Migration guide
     * Known issues tracking

3. Legacy Prompt Preservation (src/download/prompts/claude/)
   - Copied existing Claude prompts to claude/ subdirectory
   - Maintains backward compatibility
   - Allows gradual migration to shared prompts

4. Comprehensive Analysis Report (PROMPT_ARCHITECTURE_ANALYSIS.md)
   - 400+ line architectural analysis
   - Documents all compatibility differences
   - Provides detailed comparison of both approaches
   - Recommends migration strategies
   - Includes implementation guide for remaining phases

Key Features:

✅ Prompts work with both GPT-5 and Claude models
✅ Local schema source as default (reproducibility)
✅ ID generation guidance added to address validation failures
✅ Full provenance tracking with file hashing
✅ Backward compatible - existing code continues to work
✅ Mode-agnostic - supports both URL fetching and content embedding

Design Decisions:

- Schema source: Local file (default) for reproducibility
- Prompt storage: External version-controlled files
- Temperature: 0.0 recommended (deterministic)
- Execution modes: Both URL mode and content mode supported
- Backward compatibility: Maintained via legacy prompt sets

Benefits:

- Consistent prompts across all D4D extraction approaches
- Better prompt quality (combines best practices from both)
- Easier iteration and improvement of prompts
- Full reproducibility with local schema and version tracking
- Flexible execution modes for different use cases

Next Steps (Phases 2-6):

- Refactor Aurelian and Claude scripts to use D4DPromptLoader
- Add configuration options for prompt set selection
- Comprehensive testing
- Documentation updates

This is Phase 1 of 6-phase plan. See PROMPT_ARCHITECTURE_ANALYSIS.md for
complete implementation roadmap.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces a unified prompt infrastructure for D4D (Datasheets for Datasets) metadata extraction, enabling prompt sharing between the Aurelian agent approach (GPT-5) and Claude Code deterministic approach. The changes include new shared prompt templates, a flexible prompt loader system, and comprehensive documentation analyzing the architectural differences and compatibility between the two approaches.

Key changes:

  • Created unified prompt templates compatible with both URL-based and content-based extraction modes
  • Implemented D4DPromptLoader class for flexible prompt and schema loading with provenance tracking
  • Added extensive documentation analyzing schema compliance, validation issues, and migration strategies

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/download/prompts/shared/prompt_versions.yaml Version tracking configuration for unified D4D prompts with compatibility matrix and migration guidance
src/download/prompts/shared/d4d_user_prompt_url_mode.txt User prompt template for URL-based D4D extraction mode
src/download/prompts/shared/d4d_user_prompt_content_mode.txt User prompt template for content-embedded D4D extraction mode
src/download/prompts/shared/d4d_system_prompt.txt Unified system prompt combining best practices from both Aurelian and Claude approaches
src/download/prompts/claude/d4d_concatenated_user_prompt.txt Legacy Claude user prompt template (v1.0.0)
src/download/prompts/claude/d4d_concatenated_system_prompt.txt Legacy Claude system prompt template (v1.0.0)
src/download/prompt_loader.py Unified prompt loader infrastructure with schema injection, caching, and provenance tracking
data/d4d_concatenated/claudecode/SCHEMA_FIXES_REPORT.md Analysis of schema validation issues and fixes applied to support nested dataset resources
data/d4d_concatenated/claudecode/SCHEMA_COMPLIANCE_REPORT.md Comprehensive report on LinkML schema compliance and data validation status
data/d4d_concatenated/claudecode/ROOT_CAUSE_ANALYSIS.md Root cause analysis of validation errors identifying missing ID fields as core issue
data/d4d_concatenated/claudecode/PROMPT_COMPARISON.md Detailed comparison of Aurelian agent vs Claude Code deterministic prompts
data/d4d_concatenated/claudecode/PROMPT_ARCHITECTURE_ANALYSIS.md Architectural analysis of prompt compatibility and implementation strategy for unification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def _get_default_schema_path(self) -> Path:
"""Get default local schema path."""
# Navigate up from prompts dir to project root
project_root = self.prompts_dir.parent.parent.parent
Copy link

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path navigation logic may fail in different project structures. The code assumes a specific directory depth (parent.parent.parent) which is fragile. If the prompts directory is moved or the file structure changes, this will break. Consider using a more robust approach, such as searching upward for a marker file (e.g., pyproject.toml) or using an environment variable.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot add a commit to this new pull request to apply changes based on this feedback

realmarcin and others added 21 commits December 5, 2025 16:42
New Claude Code agents:
- d4d-schema-expert.md: Schema knowledge and guidance
- d4d-validator.md: Validation workflows (linkml-validate, term-validator, reference-validator)
- d4d-mapper.md: Schema mapping with linkml-map

New hooks (warn-only mode):
- protect_schema_hook.py: Warns about editing auto-generated files
- validate_d4d_yaml_hook.py: Validates D4D YAML files after Edit/Write
- term_validator_hook.py: Validates ontology term references in schema files

New D4D extraction metadata schema:
- d4d_extract_process.yaml: LinkML schema for extraction metadata
- d4d_extract_metadata.py: Python utility class for building metadata
- Updated d4d_agent_wrapper.py to emit schema-conformant metadata
- Renamed process_d4d_deterministic.py to process_d4d_claude_API_temp0.py

Dependencies added:
- linkml-map ^0.3.8
- linkml-term-validator >=0.1.1 (Python >=3.10)
- linkml-reference-validator >=0.1.1 (Python >=3.10)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Schema changes:
- Add resources attribute to Dataset class for nested sub-resources
- Add inlined_as_list: true to resources fields for proper serialization

Updated D4D YAML files:
- AI_READI, CHORUS, CM4AI, VOICE (claudecode and gpt5 versions)

Regenerated artifacts:
- JSON Schema, JSON-LD context, OWL ontology
- Python datamodel

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Based on GitHub Actions workflow instructions, creates separate prompt
directories for different LLM approaches:

Claude Code Deterministic Assistant (src/download/prompts/claudecode/):
- d4d_deterministic_create.md - Creating new datasheets
- d4d_deterministic_edit.md - Editing existing datasheets
- README.md - Documentation

GPT-5 Assistant (src/download/prompts/gpt5/):
- d4d_assistant_create.md - Creating new datasheets
- d4d_assistant_edit.md - Editing existing datasheets
- README.md - Documentation

Key features:
- Minimal modifications from GitHub Actions workflow instructions
- Tool availability adapted for each environment
- Output locations: claudecode/ and gpt5/ directories
- Validation requirements preserved

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Implements comprehensive LLM-based quality evaluation using Claude Sonnet 4.5
to complement existing field-presence detection. Provides deep content
analysis with evidence-based scoring and actionable recommendations.

New Features:
- Two evaluation agents (d4d-rubric10, d4d-rubric20) for interactive use
- Batch evaluation via Python backend (evaluate_d4d_llm.py)
- Comparison tool for LLM vs presence-based evaluation
- Temperature: 0.0 for fully deterministic quality assessments
- Multi-format export (JSON, CSV, Markdown)

Components Added:

Agent Definitions:
- .claude/agents/d4d-rubric10.md - 10-element hierarchical rubric (50 pts)
- .claude/agents/d4d-rubric20.md - 20-question detailed rubric (84 pts)

Prompt Engineering:
- src/download/prompts/rubric10_system_prompt.md - Rubric10 LLM prompt
- src/download/prompts/rubric10_output_format.json - Expected JSON output
- src/download/prompts/rubric20_system_prompt.md - Rubric20 LLM prompt
- src/download/prompts/rubric20_output_format.json - Expected JSON output

Python Backend:
- src/evaluation/evaluate_d4d_llm.py - Main LLM evaluation script (700+ lines)
- src/evaluation/compare_evaluation_methods.py - Comparison analysis (300 lines)

Documentation:
- notes/LLM_EVALUATION.md - Complete methodology guide (450+ lines)
- notes/RUBRIC_AGENT_USAGE.md - Usage examples for Claude Code prompting
- CLAUDE.md - Updated with LLM evaluation section

Infrastructure:
- project.Makefile - Added 7 new targets for LLM evaluation

Key Design Decisions:
- Temperature 0.0 (not 0.5) for full determinism and reproducibility
- Complements (not replaces) existing presence-based evaluation
- Quality-based scoring assesses completeness, actionability, usefulness
- Same D4D file → Same quality score every time
- Enables reliable tracking of improvements over time

Usage:
  # Interactive (conversational)
  "Evaluate VOICE_d4d.yaml with rubric10"

  # Batch (Makefile)
  make evaluate-d4d-llm-both
  make compare-evaluations

Cost: ~$0.10-0.30 per file via Anthropic API
Time: ~30-60 seconds per evaluation

🤖 Generated with Claude Code
Implements comprehensive, reproducible batch evaluation system with
Make integration, dry-run capability, and clear documentation.

New Scripts (Committed to Repo):
- src/evaluation/batch_evaluate_concatenated.sh
  * Evaluates all concatenated D4D files (15 files)
  * Time: ~25 minutes, Cost: ~$6
  * Includes dry-run mode
  * Built-in cost/time estimates
  * User confirmation before proceeding

- src/evaluation/batch_evaluate_individual.sh
  * Evaluates all individual D4D files (85 files)
  * Time: ~2 hours, Cost: ~$34
  * Supports filtering by PROJECT or METHOD
  * WARNING prompts for long-running evaluations

Makefile Targets Added:
- make evaluate-d4d-llm-batch-concatenated   # Main target
- make evaluate-d4d-llm-batch-dry-run        # Preview mode
- make evaluate-d4d-llm-batch-individual     # All individual files
- make evaluate-d4d-llm-batch-individual-filtered PROJECT=X|METHOD=Y
- make evaluate-d4d-llm-batch-all           # Complete evaluation

Documentation Updates:
- .claude/agents/d4d-rubric10.md: Added Reproducibility & Requirements sections
- .claude/agents/d4d-rubric20.md: Added Reproducibility & Requirements sections
- CLAUDE.md: Updated with batch evaluation workflow and reproducibility notes

Reproducibility Features:
- Temperature: 0.0 (fully deterministic)
- Model: claude-sonnet-4-5-20250929 (date-pinned)
- Rubrics: Version-controlled (data/rubric/)
- Prompts: Version-controlled (src/download/prompts/)
- Scripts: Version-controlled (src/evaluation/)
- Same D4D file → Same quality score every time

Key Design Decisions:
- Separate scripts for concatenated vs individual files
- Dry-run capability to preview evaluations
- Cost/time estimates before running
- User confirmation prompts for expensive operations
- Clear error messages for missing ANTHROPIC_API_KEY
- Progress tracking with counters

Usage Examples:
  # Preview what would be evaluated
  make evaluate-d4d-llm-batch-dry-run

  # Run batch evaluation (requires API key)
  export ANTHROPIC_API_KEY=sk-ant-...
  make evaluate-d4d-llm-batch-concatenated

  # Evaluate specific project only
  make evaluate-d4d-llm-batch-individual-filtered PROJECT=VOICE

All scripts are now committed to the repository and documented in
CLAUDE.md and agent files for easy reproducibility.

🤖 Generated with Claude Code
- Update d4d-rubric10 and d4d-rubric20 agents to show conversational workflow as PRIMARY mode
- Clarify agents work like d4d-agent/d4d-assistant (no API key needed)
- Mark external batch scripts as OPTIONAL for CI/CD automation only
- Add reproducibility notes emphasizing conversational use within Claude Code
- Add 'How These Agents Work' section explaining conversational evaluation
- Clarify no API key required for normal use (you're already using Claude Code)
- Update Example 4 to show conversational batch evaluation workflow
- Add Q&A about API key requirements and reproducibility
- Move Makefile targets to 'Optional: External Automation' section
- Emphasize conversational batch evaluation as primary method
- Document temperature=0.0 and model pinning for reproducibility
This commit adds a complete evaluation system for assessing D4D (Datasheets
for Datasets) generation quality across multiple methods and projects.

## New Evaluation Scripts

- `scripts/batch_evaluate_rubric10_hybrid.py` - Evaluates D4Ds using 10-element
  hierarchical rubric (50 sub-elements, max 50 points, binary scoring)
- `scripts/batch_evaluate_rubric20_hybrid.py` - Evaluates D4Ds using 20-question
  detailed rubric (4 categories, max 84 points, mixed numeric/pass-fail scoring)
- `scripts/summarize_rubric10_results.py` - Generates summary reports for rubric10
- `scripts/summarize_rubric20_results.py` - Generates summary reports for rubric20
- `scripts/evaluate_all_d4ds_rubric10.py` - Batch evaluation utility

## Evaluation Results (127 files evaluated)

**Rubric10 Results:**
- Average: 13.4/50 (26.9%)
- Best: 35/50 (70%) - VOICE/claudecode_agent
- Claude Code methods: 54% avg vs GPT-5: 23% avg (2.4× advantage)

**Rubric20 Results:**
- Average: 18.8/84 (22.4%)
- Best: 63/84 (75%) - VOICE/claudecode_agent
- Claude Code methods: 51% avg vs GPT-5: 16% avg (3.1× advantage)

**Files Generated:**
- `data/evaluation_llm/rubric10/` - 127 evaluations (16 concatenated + 111 individual)
- `data/evaluation_llm/rubric20/` - 127 evaluations (16 concatenated + 111 individual)
- CSV summaries, markdown reports, structured YAML summaries
- Cross-rubric comparison analysis

## LinkML Schema

- `src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml` - Structured schema
  for evaluation summaries with classes for performance metrics, insights, and
  comparative analysis

## Key Findings

1. **Method Rankings (consistent across both rubrics):**
   - claudecode: 52.6% (best)
   - claudecode_agent: 29.6%
   - claudecode_assistant: 26.1%
   - gpt5: 19.6%

2. **Synthesis Advantage:** Concatenated files (multi-source synthesis) score
   2× higher than individual files

3. **Common Gaps:** Version control (weakest), ethics/privacy documentation,
   file-level metadata, technical detail

4. **Top Performers:** All >60% scores are concatenated files using Claude Code
   methods

## Documentation

- `data/evaluation_llm/STRUCTURED_EVALUATION_SUMMARY.md` - 24KB comprehensive
  analysis with all tables, comparisons, and insights
- `data/evaluation_llm/EVALUATION_COVERAGE.md` - Coverage report documenting
  what's been evaluated (91% complete - missing curated files)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit updates both evaluation rubrics and their corresponding evaluation
scripts to align with the current D4D schema (v2.0), addressing field reference
mismatches and adding coverage for recently added schema fields.

## Changes to Rubrics

### data/rubric/rubric10.txt
- Add comprehensive field reference guide (96 lines) documenting all D4D schema fields
- Update all 10 elements with correct schema field names:
  - Element 1: Add resources, parent_datasets, related_datasets
  - Element 2: Add ip_restrictions, regulatory_restrictions, confidentiality_level
  - Element 3: Add intended_uses, prohibited_uses, discouraged_uses, variables
  - Element 4: Add participant_compensation, vulnerable_populations, participant_privacy
  - Element 5: Add variables, is_tabular, anomalies
  - Element 6: Add errata, was_derived_from
  - Element 7: Update to purposes, tasks, funders, creators
  - Element 8: Add collection_mechanisms, acquisition_methods, labeling_strategies
  - Element 9: Add known_biases, known_limitations, sensitive_elements, content_warnings
  - Element 10: Add citation, related_datasets with typed relationships
- Maintain max score: 50 points (10 elements × 5 sub-elements)

### data/rubric/rubric20.txt
- Add same comprehensive field reference guide
- Update all 20 questions with correct schema field names:
  - Q1: Add 9 new fields (resources, parent_datasets, variables, confidentiality_level, etc.)
  - Q4-Q5: Update to distribution_formats, format, media_type, bytes, instances
  - Q7: Add creators alongside funders
  - Q8: Expand to 9 fields for comprehensive ethics coverage
  - Q9: Add ip_restrictions, regulatory_restrictions, confidentiality_level
  - Q10: Update to format, encoding, conforms_to, conforms_to_schema
  - Q11-Q15: Update technical documentation fields
  - Q16-Q20: Add provenance, use guidance, and relationship fields
- Maintain max score: 84 points (16 numeric × 5 + 4 pass/fail × 1)

## Changes to Evaluation Scripts

### scripts/batch_evaluate_rubric10_hybrid.py
- Update RUBRIC10_ELEMENTS dictionary with all new schema fields
- Add new fields to all 10 elements' field lists for evaluation
- Maintain backward compatibility with old field names

### scripts/batch_evaluate_rubric20_hybrid.py
- Update RUBRIC20_QUESTIONS dictionary with all new schema fields
- Expand Q8 ethics coverage from 5 to 9 fields
- Add governance, provenance, and relationship fields across questions
- Maintain backward compatibility with old field names

## New Field Coverage

Added 20+ recently added schema fields:
- Human subjects: participant_compensation, vulnerable_populations, participant_privacy
- Data governance: ip_restrictions, regulatory_restrictions, confidentiality_level
- Hierarchical datasets: resources, parent_datasets, related_datasets
- Use guidance: intended_uses, prohibited_uses, discouraged_uses
- Data quality: anomalies, known_biases, known_limitations, variables

## Schema Alignment

All rubrics now reference actual schema class/attribute names instead of
non-existent dot-notation fields. Field references updated from patterns like
'access_and_licensing.access_policy' to correct schema fields like
'license_and_use_terms', 'ip_restrictions'.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
… v2.0

- Re-ran batch_evaluate_rubric10_hybrid.py on all D4D files
- Updated evaluation results reflect rubric10 changes:
  - New field reference guide with schema v2.0 alignment
  - Added 20+ recently added schema fields
  - Corrected field references from dot-notation to class names
- Evaluated 127 files total:
  - 16 concatenated D4D files (4 projects × 4 methods)
  - 111 individual D4D files (3 projects × 3 methods + varying counts)
- Results summary:
  - Average score: 14.7/50 (29.3%)
  - Best method: claudecode at 29.3/50 (58.7%)
  - Best project: CHORUS at 19.4/50 (38.9%)
  - Top performers: AI_READI_claudecode_assistant (78%), CM4AI_claudecode_agent (76%)
- Generated outputs:
  - all_scores.csv with comprehensive scoring data
  - summary_report.md with method/project comparisons
  - summary_table.md with tabular results
  - 127 JSON evaluation files with detailed assessments
… v2.0

- Re-ran batch_evaluate_rubric20_hybrid.py on all D4D files
- Updated evaluation results reflect rubric20 changes:
  - New field reference guide with schema v2.0 alignment
  - Expanded ethics coverage (9 fields in Q8)
  - Added governance fields (Q9)
  - Enhanced use guidance (Q18)
  - Updated hierarchical dataset relationships (Q20)
- Evaluated 127 files total:
  - 16 concatenated D4D files (4 projects × 4 methods)
  - 111 individual D4D files (3 projects × 3 methods + varying counts)
- Results summary:
  - Average score: 16.3/84 (19.4%)
  - Best method: claudecode at 38.7/84 (46.0%)
  - Best individual score: VOICE_claudecode_agent (69.0%)
  - Best project: CHORUS at 20.1/84 (23.9%)
- Rubric20 categories:
  - Cat1 (Structural): 7.2/20 avg (best)
  - Cat2 (Metadata): 3.1/20 avg
  - Cat3 (Technical): 2.7/25 avg (weakest)
  - Cat4 (FAIRness): 3.3/19 avg
- Generated outputs:
  - all_scores.csv with comprehensive scoring data
  - summary_report.md with method/project/category comparisons
  - summary_table.md with tabular results
  - 127 JSON evaluation files with detailed assessments
Issue: AI_READI files were being parsed incorrectly due to underscore in project name
- Project was extracted as 'AI' instead of 'AI_READI'
- Method was extracted as 'READI_claudecode' instead of 'claudecode'
- This caused AI_READI results to be excluded from summary reports

Root cause: Simple split on '_' doesn't handle project names with underscores

Fixes:
- Updated batch_evaluate_rubric10_hybrid.py: Added special handling for AI_READI
- Updated batch_evaluate_rubric20_hybrid.py: Added special handling for AI_READI
- Fixed all AI_READI evaluation JSON metadata (project/method fields)
- Regenerated summary tables with AI_READI now included

Results now show AI_READI claudecode concatenated scores:
- Rubric10: 34/50 (68.0%)
- Rubric20: 49/84 (58.3%)

Impact: AI_READI is now properly included in all evaluation summaries and comparisons
Features:
- Python script: scripts/generate_gc_approach_comparison.py
  - Loads evaluation results from rubric10 and rubric20
  - Generates comprehensive 16-row comparison (4 GCs × 4 approaches)
  - Calculates GC averages and overall method averages
  - Outputs both TSV and Markdown formats

- Makefile target: make gen-gc-approach-table
  - Runs generator script
  - Creates data/evaluation_llm/gc_approach_comparison.{tsv,md}

Output files:
- gc_approach_comparison.tsv: Tab-separated for analysis in Excel/R/Python
- gc_approach_comparison.md: Human-readable markdown table

Table format (30 rows):
- 4 rows per GC (one per approach) + 1 GC average = 5 rows × 4 GCs = 20 rows
- 4 overall method average rows
- 1 grand average row
- 5 blank separator rows
- Total: 30 rows

Usage:
  make gen-gc-approach-table

Results show:
- Best approach: claudecode_agent (74.0% R10, 62.5% R20)
- Best GC×Approach: AI_READI × claudecode_assistant (78.0% R10)
- Weakest: VOICE × gpt5 (2.0% R10, 0.0% R20)
New targets:
- evaluate-rubric10-all: Run Rubric10 hybrid evaluation on all 127 D4D files
- evaluate-rubric20-all: Run Rubric20 hybrid evaluation on all 127 D4D files
- evaluate-rubrics-all: Run both rubric evaluations
- evaluate-and-report: Complete pipeline (evaluate + generate comparison table)

Each target:
1. Runs batch_evaluate_rubric*_hybrid.py on all projects/methods
2. Generates summary reports with summarize_rubric*_results.py
3. Creates all output files in data/evaluation_llm/

Complete workflow:
  make evaluate-and-report

Or run individual steps:
  make evaluate-rubric10-all   # Just Rubric10
  make evaluate-rubric20-all   # Just Rubric20
  make gen-gc-approach-table   # Just generate table from existing results
This commit adds conversational Claude Code agents that perform semantic
quality evaluation of D4D datasheets with enhanced analysis capabilities.

New Files:
- .claude/agents/d4d-rubric10-semantic.md (576 lines)
  * Extends rubric10 with semantic analysis, correctness validation, consistency checking
  * DOI/grant number format validation with prefix plausibility
  * Cross-field consistency checks (human subjects → IRB approval)
  * Content accuracy assessment (ethics claims, deidentification methods)

- .claude/agents/d4d-rubric20-semantic.md (635 lines)
  * Extends rubric20 with same semantic enhancements
  * Adapted for 20-question format with 0-5 scoring scale
  * FAIR principle alignment checking
  * Category-specific semantic analysis

Updated Files:
- notes/RUBRIC_AGENT_USAGE.md (+264 lines)
  * Added comprehensive Semantic Evaluation Agents section
  * Usage examples comparing standard vs semantic evaluation
  * Detailed capability descriptions and when to use each agent
  * Example outputs showing semantic issue detection

Key Features:
✅ Semantic Understanding - Checks if content matches expected meaning
✅ Correctness Validation - DOI/grant number/RRID format + plausibility
✅ Consistency Checking - Cross-field logical relationships
✅ Content Accuracy - Ethics claims, funding patterns, temporal logic

Usage:
- "Evaluate [file] with rubric10-semantic"
- "Run semantic FAIR compliance check using rubric20-semantic"

Implementation:
- Conversational agents (no API calls required)
- Temperature: 0.0 (deterministic)
- Model: claude-sonnet-4-5-20250929
- Enhanced JSON output with semantic_analysis section

🤖 Generated with Claude Code
Update all four rubric agent definitions to specify D4D_Evaluation_Summary
schema compliance for batch evaluation outputs:

- .claude/agents/d4d-rubric10.md: Add batch summary section with
  element_performance structure (10 elements, max 50 points)
- .claude/agents/d4d-rubric20.md: Add batch summary section with
  category_performance structure (4 categories, max 84 points)
- .claude/agents/d4d-rubric10-semantic.md: Add batch summary with
  semantic_analysis_summary (issue tracking, consistency checks)
- .claude/agents/d4d-rubric20-semantic.md: Add batch summary with
  semantic_analysis_summary by category

All agents now specify:
- evaluation_summary.yaml conforming to EvaluationSummary class
- Required fields: overall_performance, method_comparison,
  project_comparison, element/category_performance
- Semantic agents include semantic_analysis_summary with issue
  breakdown, common consistency/correctness issues, and insights
- Additional outputs: all_scores.csv, summary_report.md

References: src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Generated summary reports for the rubric10-semantic evaluation of all 16
concatenated D4D files (15 successfully processed, 1 format incompatible).

Files added:
- data/evaluation_llm/rubric10_semantic/all_scores.csv - CSV with all scores
- data/evaluation_llm/rubric10_semantic/summary_report.md - Human-readable report
- scripts/generate_rubric10_semantic_summary_simple.py - Summary generator

Key findings:
- Average score: 28.2/50 (56.4%)
- Best performer: VOICE claudecode_agent - 46/50 (92.0%)
- Worst performer: CHORUS gpt5 - 7/50 (14.0%)
- By method: claudecode_agent leads with 66.7% average
- By project: VOICE scores highest with 75.0% average

Note: Simplified summary due to varying JSON output formats from different
evaluation agents. For detailed semantic analysis, refer to individual
evaluation JSON files in data/evaluation_llm/rubric10_semantic/concatenated/.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 54 out of 346 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI commented Dec 9, 2025

@realmarcin I've opened a new pull request, #102, to work on those changes. Once the pull request is ready, I'll request review from you.

@@ -76,6 +76,7 @@ classes:
resources:
range: Dataset
multivalued: true
inlined_as_list: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be feasible to just define resources as a slot and then include it as a slot wherever it's used, rather than having two different attributes with the same name

realmarcin and others added 11 commits December 9, 2025 17:48
Added comprehensive D4D YAML files generated using the claudecode_agent method:
- AI_READI_d4d.yaml (31 KB) - Score: 31/50 (62.0%)
- CHORUS_d4d.yaml (18 KB) - Score: 26/50 (52.0%)
- CM4AI_d4d.yaml (37 KB) - Score: 28/50 (56.0%)
- VOICE_d4d.yaml (68 KB) - Score: 46/50 (92.0%) - Best performer

These files were evaluated using rubric10-semantic and represent the
highest-scoring generation method (66.7% average) in the evaluation.

Generated: December 6, 2024 (VOICE updated December 6, 2024)

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Generated human-readable HTML documentation for all 4 claudecode_agent
concatenated D4D files:

- AI_READI_d4d_human_readable.html (40 KB)
- CHORUS_d4d_human_readable.html (37 KB)
- CM4AI_d4d_human_readable.html (44 KB)
- VOICE_d4d_human_readable.html (59 KB)

Added rendering script:
- scripts/render_claudecode_agent_html.py

These HTML files provide accessible, human-readable documentation of the
D4D metadata for each project using the HumanReadableRenderer template.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added detailed JSON evaluation results for all 16 concatenated D4D files
(4 projects × 4 generation methods) using rubric10-semantic rubric.

Includes comprehensive semantic analysis with:
- Element-by-element scoring (10 elements, 50 sub-elements)
- Identifier validation (DOI, grant numbers, IRB, URLs)
- Consistency checking across related fields
- Completeness gap analysis (high/medium/low priority)
- Strengths and weaknesses documentation
- Actionable recommendations

Files: 16 evaluation JSON files (15 successful, 1 format incompatible)
Total size: ~450 KB of detailed semantic analysis

Evaluation conducted: December 9, 2024

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Generated human-readable HTML evaluation reports for all 4 claudecode_agent
concatenated D4D files:

- AI_READI_evaluation.html (60 KB) - Score: 31/50 (62.0%)
- CHORUS_evaluation.html (15 KB) - Score: 26/50 (52.0%)
- CM4AI_evaluation.html (11 KB) - Score: 28/50 (56.0%)
- VOICE_evaluation.html (12 KB) - Score: 46/50 (92.0%) - Best performer

Added rendering script:
- scripts/render_evaluation_html.py

HTML files include:
- Overall scores and metadata
- Element-by-element scoring with rationales
- Semantic analysis (identifier validation, consistency checks)
- Completeness gaps (high/medium/low priority)
- Strengths and weaknesses
- Actionable recommendations

Files are stored alongside the D4D documentation HTML in:
data/d4d_html/concatenated/claudecode_agent/

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Updated LinkML schema modules with improvements:
- D4D_Base_import: Enhanced base classes and shared enums
- D4D_Collection: Improved collection process metadata
- D4D_Composition: Enhanced composition and structure fields
- D4D_Data_Governance: Updated governance and licensing
- D4D_Distribution: Improved distribution metadata
- D4D_Ethics: Enhanced ethics and human subjects fields
- D4D_Maintenance: Updated maintenance and versioning
- D4D_Motivation: Improved motivation and funding metadata
- D4D_Preprocessing: Enhanced preprocessing documentation
- D4D_Uses: Updated use cases and recommendations
- data_sheets_schema.yaml: Added new Dataset attributes
- data_sheets_schema_all.yaml: Regenerated merged schema

Regenerated artifacts:
- Python datamodel (data_sheets_schema.py)
- JSON-LD context
- JSON Schema
- OWL ontology

Total changes: 12 schema files, ~6,000 lines modified

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added comprehensive help sections for:
- Download & preprocess sources (new section)
- Interactive Claude Code approaches (Agent vs Assistant)
- Slash command documentation (/d4d-agent, /d4d-assistant)
- Enhanced extraction command descriptions

Updated prompt documentation:
- src/download/prompts/claudecode/README.md

Improves discoverability of D4D pipeline features and workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Updated all 4 claudecode method D4D files to align with enhanced schema:
- AI_READI_d4d.yaml
- CHORUS_d4d.yaml
- CM4AI_d4d.yaml
- VOICE_d4d.yaml

Removed obsolete file:
- CM4AI_d4d_regenerated.yaml

Changes reflect schema updates and improved metadata completeness.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added slash commands for D4D workflows:
- /d4d-agent - Parallel agent-based D4D generation
- /d4d-assistant - In-session assistant-based D4D generation
- /d4d-webfetch - Web-based D4D extraction
- README.md - Slash command documentation

Added documentation:
- notes/D4D_AGENT_GITHUB_UNIFICATION.md - Agent workflow guide
- notes/RUBRIC10_EVALUATION_GUIDE.md - Evaluation rubric guide
- scripts/generate_rubric10_semantic_summary.py - Evaluation summary generator

These tools support the D4D metadata extraction and evaluation workflows.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Added claudecode_assistant concatenated D4D files:
- AI_READI_d4d.yaml
- CHORUS_d4d.yaml
- CM4AI_d4d.yaml
- VOICE_d4d.yaml

Updated extraction reports:
- data/raw/organized_extraction_report.md
- data/raw/organized_extraction_summary.json

The claudecode_assistant method represents in-session synthesis
with direct user interaction during generation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Convert external_resources from duplicated class attribute definitions
to a shared LinkML slot following best practices.

Changes:
- Add external_resources slot to D4D_Base_import.yaml with dcterms:references URI
- Update ExternalResource class to use slots + slot_usage (range: string)
- Update Dataset class to use slots + slot_usage (range: ExternalResource)
- Remove duplicate attribute definition from Dataset

Benefits:
- Single source of truth for external_resources semantics
- Follows LinkML best practices for slot reuse
- Easier maintenance with centralized slot_uri and properties
- No functional changes - maintains backward compatibility

The recursive structure still works correctly:
- Dataset.external_resources → list of ExternalResource objects
- ExternalResource.external_resources → list of URL strings

All tests pass and existing D4D YAML files validate successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Convert resources from duplicated class attribute definitions to a
shared LinkML slot, following the same pattern as external_resources.

Changes:
- Add resources slot to D4D_Base_import.yaml with range: Dataset
- Update DatasetCollection class to use slots + slot_usage
- Update Dataset class to use slots + slot_usage with nested description
- Remove duplicate attribute definitions

Benefits:
- Eliminates "Ambiguous attribute: resources" OWL generation warnings
- Single source of truth for resources semantics
- Follows LinkML best practices for slot reuse
- Consistent pattern with external_resources refactoring
- No functional changes - maintains backward compatibility

The recursive structure still works correctly:
- DatasetCollection.resources → list of Dataset objects
- Dataset.resources → list of nested Dataset objects (sub-resources)

All tests pass with no ambiguous attribute warnings.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@realmarcin realmarcin requested a review from Copilot December 10, 2025 07:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 52 out of 544 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin merged commit 02021d0 into main Dec 10, 2025
3 checks passed
@realmarcin realmarcin deleted the prompt-explore branch December 10, 2025 07:59
@realmarcin realmarcin restored the prompt-explore branch December 10, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants