Prompt explore #99

realmarcin · 2025-12-04T21:43:29Z

No description provided.

This commit adds four detailed analysis documents investigating the root causes of D4D validation failures: 1. SCHEMA_COMPLIANCE_REPORT.md - Complete LinkML schema validation analysis confirming schemas are 100% compliant. Documents that all validation errors stem from missing 'id' fields in generated data, not schema design issues. Provides 4 solution options with pros/cons and phased implementation recommendations. 2. ROOT_CAUSE_ANALYSIS.md - Deep dive into why GPT-5 omits required 'id' fields. Proves that GPT-5 correctly uses 'response' field (as per schema definition) but fails to generate IDs because prompts lack explicit ID generation guidance. Documents inheritance chain: Purpose/Task → DatasetProperty → NamedThing (where id is required). 3. PROMPT_COMPARISON.md - Side-by-side comparison of Aurelian agent (GPT-5) and Claude Code deterministic prompts. Shows prompts are similar but not identical, with neither explicitly specifying field names or ID generation patterns. Proves prompt differences are not the root cause of validation failures. 4. SCHEMA_FIXES_REPORT.md - Documents schema enhancements made to support nested dataset resources. Added 'resources' field to Dataset class with inlined_as_list support. Fixed VOICE missing top-level metadata. Shows remaining errors are data quality issues, not schema structure problems. Key findings: - Schemas validate successfully (114 warnings, 0 errors) - 80 validation errors across 4 projects due to missing 'id' fields - LLMs omit IDs because prompts say "only populate what you're sure about" - Solution options range from quick fix (make ID optional) to systematic fix (post-process YAML to auto-generate IDs) These reports provide complete context for deciding on validation fix strategy and serve as documentation for future D4D generation improvements. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit implements Phase 1 of making D4D prompts interchangeable between the Aurelian agent (GPT-5) and Claude Code deterministic approaches. New Infrastructure: 1. D4DPromptLoader class (src/download/prompt_loader.py) - Unified interface for loading D4D prompts from external files - Supports multiple prompt sets: "shared", "aurelian", "claude" - Configurable schema source with LOCAL as default (per user requirement) - Schema caching and SHA-256 hashing for provenance - Full metadata generation for reproducibility - Backward compatible helper functions 2. Unified Prompt Set (src/download/prompts/shared/) - d4d_system_prompt.txt - Best-of-both approach combining: * Aurelian's clarity and simplicity * Claude's detailed 17-point extraction checklist * Mode-agnostic (works for URL mode and content mode) * **Includes explicit ID generation guidance** (addresses validation issues!) - d4d_user_prompt_content_mode.txt - For pre-loaded content - d4d_user_prompt_url_mode.txt - For URL-based agent tool workflow - prompt_versions.yaml - Comprehensive version tracking with: * Version history and changelog * Compatibility matrix * Migration guide * Known issues tracking 3. Legacy Prompt Preservation (src/download/prompts/claude/) - Copied existing Claude prompts to claude/ subdirectory - Maintains backward compatibility - Allows gradual migration to shared prompts 4. Comprehensive Analysis Report (PROMPT_ARCHITECTURE_ANALYSIS.md) - 400+ line architectural analysis - Documents all compatibility differences - Provides detailed comparison of both approaches - Recommends migration strategies - Includes implementation guide for remaining phases Key Features: ✅ Prompts work with both GPT-5 and Claude models ✅ Local schema source as default (reproducibility) ✅ ID generation guidance added to address validation failures ✅ Full provenance tracking with file hashing ✅ Backward compatible - existing code continues to work ✅ Mode-agnostic - supports both URL fetching and content embedding Design Decisions: - Schema source: Local file (default) for reproducibility - Prompt storage: External version-controlled files - Temperature: 0.0 recommended (deterministic) - Execution modes: Both URL mode and content mode supported - Backward compatibility: Maintained via legacy prompt sets Benefits: - Consistent prompts across all D4D extraction approaches - Better prompt quality (combines best practices from both) - Easier iteration and improvement of prompts - Full reproducibility with local schema and version tracking - Flexible execution modes for different use cases Next Steps (Phases 2-6): - Refactor Aurelian and Claude scripts to use D4DPromptLoader - Add configuration options for prompt set selection - Comprehensive testing - Documentation updates This is Phase 1 of 6-phase plan. See PROMPT_ARCHITECTURE_ANALYSIS.md for complete implementation roadmap. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull request overview

This pull request introduces a unified prompt infrastructure for D4D (Datasheets for Datasets) metadata extraction, enabling prompt sharing between the Aurelian agent approach (GPT-5) and Claude Code deterministic approach. The changes include new shared prompt templates, a flexible prompt loader system, and comprehensive documentation analyzing the architectural differences and compatibility between the two approaches.

Key changes:

Created unified prompt templates compatible with both URL-based and content-based extraction modes
Implemented D4DPromptLoader class for flexible prompt and schema loading with provenance tracking
Added extensive documentation analyzing schema compliance, validation issues, and migration strategies

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/download/prompts/shared/prompt_versions.yaml	Version tracking configuration for unified D4D prompts with compatibility matrix and migration guidance
src/download/prompts/shared/d4d_user_prompt_url_mode.txt	User prompt template for URL-based D4D extraction mode
src/download/prompts/shared/d4d_user_prompt_content_mode.txt	User prompt template for content-embedded D4D extraction mode
src/download/prompts/shared/d4d_system_prompt.txt	Unified system prompt combining best practices from both Aurelian and Claude approaches
src/download/prompts/claude/d4d_concatenated_user_prompt.txt	Legacy Claude user prompt template (v1.0.0)
src/download/prompts/claude/d4d_concatenated_system_prompt.txt	Legacy Claude system prompt template (v1.0.0)
src/download/prompt_loader.py	Unified prompt loader infrastructure with schema injection, caching, and provenance tracking
data/d4d_concatenated/claudecode/SCHEMA_FIXES_REPORT.md	Analysis of schema validation issues and fixes applied to support nested dataset resources
data/d4d_concatenated/claudecode/SCHEMA_COMPLIANCE_REPORT.md	Comprehensive report on LinkML schema compliance and data validation status
data/d4d_concatenated/claudecode/ROOT_CAUSE_ANALYSIS.md	Root cause analysis of validation errors identifying missing ID fields as core issue
data/d4d_concatenated/claudecode/PROMPT_COMPARISON.md	Detailed comparison of Aurelian agent vs Claude Code deterministic prompts
data/d4d_concatenated/claudecode/PROMPT_ARCHITECTURE_ANALYSIS.md	Architectural analysis of prompt compatibility and implementation strategy for unification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/download/prompt_loader.py

Copilot · 2025-12-04T21:48:22Z

src/download/prompt_loader.py

+    def _get_default_schema_path(self) -> Path:
+        """Get default local schema path."""
+        # Navigate up from prompts dir to project root
+        project_root = self.prompts_dir.parent.parent.parent


Path navigation logic may fail in different project structures. The code assumes a specific directory depth (parent.parent.parent) which is fragile. If the prompts directory is moved or the file structure changes, this will break. Consider using a more robust approach, such as searching upward for a marker file (e.g., pyproject.toml) or using an environment variable.

@copilot add a commit to this new pull request to apply changes based on this feedback

data/d4d_concatenated/claudecode/SCHEMA_FIXES_REPORT.md

New Claude Code agents: - d4d-schema-expert.md: Schema knowledge and guidance - d4d-validator.md: Validation workflows (linkml-validate, term-validator, reference-validator) - d4d-mapper.md: Schema mapping with linkml-map New hooks (warn-only mode): - protect_schema_hook.py: Warns about editing auto-generated files - validate_d4d_yaml_hook.py: Validates D4D YAML files after Edit/Write - term_validator_hook.py: Validates ontology term references in schema files New D4D extraction metadata schema: - d4d_extract_process.yaml: LinkML schema for extraction metadata - d4d_extract_metadata.py: Python utility class for building metadata - Updated d4d_agent_wrapper.py to emit schema-conformant metadata - Renamed process_d4d_deterministic.py to process_d4d_claude_API_temp0.py Dependencies added: - linkml-map ^0.3.8 - linkml-term-validator >=0.1.1 (Python >=3.10) - linkml-reference-validator >=0.1.1 (Python >=3.10) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Schema changes: - Add resources attribute to Dataset class for nested sub-resources - Add inlined_as_list: true to resources fields for proper serialization Updated D4D YAML files: - AI_READI, CHORUS, CM4AI, VOICE (claudecode and gpt5 versions) Regenerated artifacts: - JSON Schema, JSON-LD context, OWL ontology - Python datamodel 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Based on GitHub Actions workflow instructions, creates separate prompt directories for different LLM approaches: Claude Code Deterministic Assistant (src/download/prompts/claudecode/): - d4d_deterministic_create.md - Creating new datasheets - d4d_deterministic_edit.md - Editing existing datasheets - README.md - Documentation GPT-5 Assistant (src/download/prompts/gpt5/): - d4d_assistant_create.md - Creating new datasheets - d4d_assistant_edit.md - Editing existing datasheets - README.md - Documentation Key features: - Minimal modifications from GitHub Actions workflow instructions - Tool availability adapted for each environment - Output locations: claudecode/ and gpt5/ directories - Validation requirements preserved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Implements comprehensive LLM-based quality evaluation using Claude Sonnet 4.5 to complement existing field-presence detection. Provides deep content analysis with evidence-based scoring and actionable recommendations. New Features: - Two evaluation agents (d4d-rubric10, d4d-rubric20) for interactive use - Batch evaluation via Python backend (evaluate_d4d_llm.py) - Comparison tool for LLM vs presence-based evaluation - Temperature: 0.0 for fully deterministic quality assessments - Multi-format export (JSON, CSV, Markdown) Components Added: Agent Definitions: - .claude/agents/d4d-rubric10.md - 10-element hierarchical rubric (50 pts) - .claude/agents/d4d-rubric20.md - 20-question detailed rubric (84 pts) Prompt Engineering: - src/download/prompts/rubric10_system_prompt.md - Rubric10 LLM prompt - src/download/prompts/rubric10_output_format.json - Expected JSON output - src/download/prompts/rubric20_system_prompt.md - Rubric20 LLM prompt - src/download/prompts/rubric20_output_format.json - Expected JSON output Python Backend: - src/evaluation/evaluate_d4d_llm.py - Main LLM evaluation script (700+ lines) - src/evaluation/compare_evaluation_methods.py - Comparison analysis (300 lines) Documentation: - notes/LLM_EVALUATION.md - Complete methodology guide (450+ lines) - notes/RUBRIC_AGENT_USAGE.md - Usage examples for Claude Code prompting - CLAUDE.md - Updated with LLM evaluation section Infrastructure: - project.Makefile - Added 7 new targets for LLM evaluation Key Design Decisions: - Temperature 0.0 (not 0.5) for full determinism and reproducibility - Complements (not replaces) existing presence-based evaluation - Quality-based scoring assesses completeness, actionability, usefulness - Same D4D file → Same quality score every time - Enables reliable tracking of improvements over time Usage: # Interactive (conversational) "Evaluate VOICE_d4d.yaml with rubric10" # Batch (Makefile) make evaluate-d4d-llm-both make compare-evaluations Cost: ~$0.10-0.30 per file via Anthropic API Time: ~30-60 seconds per evaluation 🤖 Generated with Claude Code

Implements comprehensive, reproducible batch evaluation system with Make integration, dry-run capability, and clear documentation. New Scripts (Committed to Repo): - src/evaluation/batch_evaluate_concatenated.sh * Evaluates all concatenated D4D files (15 files) * Time: ~25 minutes, Cost: ~$6 * Includes dry-run mode * Built-in cost/time estimates * User confirmation before proceeding - src/evaluation/batch_evaluate_individual.sh * Evaluates all individual D4D files (85 files) * Time: ~2 hours, Cost: ~$34 * Supports filtering by PROJECT or METHOD * WARNING prompts for long-running evaluations Makefile Targets Added: - make evaluate-d4d-llm-batch-concatenated # Main target - make evaluate-d4d-llm-batch-dry-run # Preview mode - make evaluate-d4d-llm-batch-individual # All individual files - make evaluate-d4d-llm-batch-individual-filtered PROJECT=X|METHOD=Y - make evaluate-d4d-llm-batch-all # Complete evaluation Documentation Updates: - .claude/agents/d4d-rubric10.md: Added Reproducibility & Requirements sections - .claude/agents/d4d-rubric20.md: Added Reproducibility & Requirements sections - CLAUDE.md: Updated with batch evaluation workflow and reproducibility notes Reproducibility Features: - Temperature: 0.0 (fully deterministic) - Model: claude-sonnet-4-5-20250929 (date-pinned) - Rubrics: Version-controlled (data/rubric/) - Prompts: Version-controlled (src/download/prompts/) - Scripts: Version-controlled (src/evaluation/) - Same D4D file → Same quality score every time Key Design Decisions: - Separate scripts for concatenated vs individual files - Dry-run capability to preview evaluations - Cost/time estimates before running - User confirmation prompts for expensive operations - Clear error messages for missing ANTHROPIC_API_KEY - Progress tracking with counters Usage Examples: # Preview what would be evaluated make evaluate-d4d-llm-batch-dry-run # Run batch evaluation (requires API key) export ANTHROPIC_API_KEY=sk-ant-... make evaluate-d4d-llm-batch-concatenated # Evaluate specific project only make evaluate-d4d-llm-batch-individual-filtered PROJECT=VOICE All scripts are now committed to the repository and documented in CLAUDE.md and agent files for easy reproducibility. 🤖 Generated with Claude Code

- Update d4d-rubric10 and d4d-rubric20 agents to show conversational workflow as PRIMARY mode - Clarify agents work like d4d-agent/d4d-assistant (no API key needed) - Mark external batch scripts as OPTIONAL for CI/CD automation only - Add reproducibility notes emphasizing conversational use within Claude Code

- Add 'How These Agents Work' section explaining conversational evaluation - Clarify no API key required for normal use (you're already using Claude Code) - Update Example 4 to show conversational batch evaluation workflow - Add Q&A about API key requirements and reproducibility - Move Makefile targets to 'Optional: External Automation' section - Emphasize conversational batch evaluation as primary method - Document temperature=0.0 and model pinning for reproducibility

This commit adds a complete evaluation system for assessing D4D (Datasheets for Datasets) generation quality across multiple methods and projects. ## New Evaluation Scripts - `scripts/batch_evaluate_rubric10_hybrid.py` - Evaluates D4Ds using 10-element hierarchical rubric (50 sub-elements, max 50 points, binary scoring) - `scripts/batch_evaluate_rubric20_hybrid.py` - Evaluates D4Ds using 20-question detailed rubric (4 categories, max 84 points, mixed numeric/pass-fail scoring) - `scripts/summarize_rubric10_results.py` - Generates summary reports for rubric10 - `scripts/summarize_rubric20_results.py` - Generates summary reports for rubric20 - `scripts/evaluate_all_d4ds_rubric10.py` - Batch evaluation utility ## Evaluation Results (127 files evaluated) **Rubric10 Results:** - Average: 13.4/50 (26.9%) - Best: 35/50 (70%) - VOICE/claudecode_agent - Claude Code methods: 54% avg vs GPT-5: 23% avg (2.4× advantage) **Rubric20 Results:** - Average: 18.8/84 (22.4%) - Best: 63/84 (75%) - VOICE/claudecode_agent - Claude Code methods: 51% avg vs GPT-5: 16% avg (3.1× advantage) **Files Generated:** - `data/evaluation_llm/rubric10/` - 127 evaluations (16 concatenated + 111 individual) - `data/evaluation_llm/rubric20/` - 127 evaluations (16 concatenated + 111 individual) - CSV summaries, markdown reports, structured YAML summaries - Cross-rubric comparison analysis ## LinkML Schema - `src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml` - Structured schema for evaluation summaries with classes for performance metrics, insights, and comparative analysis ## Key Findings 1. **Method Rankings (consistent across both rubrics):** - claudecode: 52.6% (best) - claudecode_agent: 29.6% - claudecode_assistant: 26.1% - gpt5: 19.6% 2. **Synthesis Advantage:** Concatenated files (multi-source synthesis) score 2× higher than individual files 3. **Common Gaps:** Version control (weakest), ethics/privacy documentation, file-level metadata, technical detail 4. **Top Performers:** All >60% scores are concatenated files using Claude Code methods ## Documentation - `data/evaluation_llm/STRUCTURED_EVALUATION_SUMMARY.md` - 24KB comprehensive analysis with all tables, comparisons, and insights - `data/evaluation_llm/EVALUATION_COVERAGE.md` - Coverage report documenting what's been evaluated (91% complete - missing curated files) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

This commit updates both evaluation rubrics and their corresponding evaluation scripts to align with the current D4D schema (v2.0), addressing field reference mismatches and adding coverage for recently added schema fields. ## Changes to Rubrics ### data/rubric/rubric10.txt - Add comprehensive field reference guide (96 lines) documenting all D4D schema fields - Update all 10 elements with correct schema field names: - Element 1: Add resources, parent_datasets, related_datasets - Element 2: Add ip_restrictions, regulatory_restrictions, confidentiality_level - Element 3: Add intended_uses, prohibited_uses, discouraged_uses, variables - Element 4: Add participant_compensation, vulnerable_populations, participant_privacy - Element 5: Add variables, is_tabular, anomalies - Element 6: Add errata, was_derived_from - Element 7: Update to purposes, tasks, funders, creators - Element 8: Add collection_mechanisms, acquisition_methods, labeling_strategies - Element 9: Add known_biases, known_limitations, sensitive_elements, content_warnings - Element 10: Add citation, related_datasets with typed relationships - Maintain max score: 50 points (10 elements × 5 sub-elements) ### data/rubric/rubric20.txt - Add same comprehensive field reference guide - Update all 20 questions with correct schema field names: - Q1: Add 9 new fields (resources, parent_datasets, variables, confidentiality_level, etc.) - Q4-Q5: Update to distribution_formats, format, media_type, bytes, instances - Q7: Add creators alongside funders - Q8: Expand to 9 fields for comprehensive ethics coverage - Q9: Add ip_restrictions, regulatory_restrictions, confidentiality_level - Q10: Update to format, encoding, conforms_to, conforms_to_schema - Q11-Q15: Update technical documentation fields - Q16-Q20: Add provenance, use guidance, and relationship fields - Maintain max score: 84 points (16 numeric × 5 + 4 pass/fail × 1) ## Changes to Evaluation Scripts ### scripts/batch_evaluate_rubric10_hybrid.py - Update RUBRIC10_ELEMENTS dictionary with all new schema fields - Add new fields to all 10 elements' field lists for evaluation - Maintain backward compatibility with old field names ### scripts/batch_evaluate_rubric20_hybrid.py - Update RUBRIC20_QUESTIONS dictionary with all new schema fields - Expand Q8 ethics coverage from 5 to 9 fields - Add governance, provenance, and relationship fields across questions - Maintain backward compatibility with old field names ## New Field Coverage Added 20+ recently added schema fields: - Human subjects: participant_compensation, vulnerable_populations, participant_privacy - Data governance: ip_restrictions, regulatory_restrictions, confidentiality_level - Hierarchical datasets: resources, parent_datasets, related_datasets - Use guidance: intended_uses, prohibited_uses, discouraged_uses - Data quality: anomalies, known_biases, known_limitations, variables ## Schema Alignment All rubrics now reference actual schema class/attribute names instead of non-existent dot-notation fields. Field references updated from patterns like 'access_and_licensing.access_policy' to correct schema fields like 'license_and_use_terms', 'ip_restrictions'. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

… v2.0 - Re-ran batch_evaluate_rubric10_hybrid.py on all D4D files - Updated evaluation results reflect rubric10 changes: - New field reference guide with schema v2.0 alignment - Added 20+ recently added schema fields - Corrected field references from dot-notation to class names - Evaluated 127 files total: - 16 concatenated D4D files (4 projects × 4 methods) - 111 individual D4D files (3 projects × 3 methods + varying counts) - Results summary: - Average score: 14.7/50 (29.3%) - Best method: claudecode at 29.3/50 (58.7%) - Best project: CHORUS at 19.4/50 (38.9%) - Top performers: AI_READI_claudecode_assistant (78%), CM4AI_claudecode_agent (76%) - Generated outputs: - all_scores.csv with comprehensive scoring data - summary_report.md with method/project comparisons - summary_table.md with tabular results - 127 JSON evaluation files with detailed assessments

… v2.0 - Re-ran batch_evaluate_rubric20_hybrid.py on all D4D files - Updated evaluation results reflect rubric20 changes: - New field reference guide with schema v2.0 alignment - Expanded ethics coverage (9 fields in Q8) - Added governance fields (Q9) - Enhanced use guidance (Q18) - Updated hierarchical dataset relationships (Q20) - Evaluated 127 files total: - 16 concatenated D4D files (4 projects × 4 methods) - 111 individual D4D files (3 projects × 3 methods + varying counts) - Results summary: - Average score: 16.3/84 (19.4%) - Best method: claudecode at 38.7/84 (46.0%) - Best individual score: VOICE_claudecode_agent (69.0%) - Best project: CHORUS at 20.1/84 (23.9%) - Rubric20 categories: - Cat1 (Structural): 7.2/20 avg (best) - Cat2 (Metadata): 3.1/20 avg - Cat3 (Technical): 2.7/25 avg (weakest) - Cat4 (FAIRness): 3.3/19 avg - Generated outputs: - all_scores.csv with comprehensive scoring data - summary_report.md with method/project/category comparisons - summary_table.md with tabular results - 127 JSON evaluation files with detailed assessments

Issue: AI_READI files were being parsed incorrectly due to underscore in project name - Project was extracted as 'AI' instead of 'AI_READI' - Method was extracted as 'READI_claudecode' instead of 'claudecode' - This caused AI_READI results to be excluded from summary reports Root cause: Simple split on '_' doesn't handle project names with underscores Fixes: - Updated batch_evaluate_rubric10_hybrid.py: Added special handling for AI_READI - Updated batch_evaluate_rubric20_hybrid.py: Added special handling for AI_READI - Fixed all AI_READI evaluation JSON metadata (project/method fields) - Regenerated summary tables with AI_READI now included Results now show AI_READI claudecode concatenated scores: - Rubric10: 34/50 (68.0%) - Rubric20: 49/84 (58.3%) Impact: AI_READI is now properly included in all evaluation summaries and comparisons

Features: - Python script: scripts/generate_gc_approach_comparison.py - Loads evaluation results from rubric10 and rubric20 - Generates comprehensive 16-row comparison (4 GCs × 4 approaches) - Calculates GC averages and overall method averages - Outputs both TSV and Markdown formats - Makefile target: make gen-gc-approach-table - Runs generator script - Creates data/evaluation_llm/gc_approach_comparison.{tsv,md} Output files: - gc_approach_comparison.tsv: Tab-separated for analysis in Excel/R/Python - gc_approach_comparison.md: Human-readable markdown table Table format (30 rows): - 4 rows per GC (one per approach) + 1 GC average = 5 rows × 4 GCs = 20 rows - 4 overall method average rows - 1 grand average row - 5 blank separator rows - Total: 30 rows Usage: make gen-gc-approach-table Results show: - Best approach: claudecode_agent (74.0% R10, 62.5% R20) - Best GC×Approach: AI_READI × claudecode_assistant (78.0% R10) - Weakest: VOICE × gpt5 (2.0% R10, 0.0% R20)

New targets: - evaluate-rubric10-all: Run Rubric10 hybrid evaluation on all 127 D4D files - evaluate-rubric20-all: Run Rubric20 hybrid evaluation on all 127 D4D files - evaluate-rubrics-all: Run both rubric evaluations - evaluate-and-report: Complete pipeline (evaluate + generate comparison table) Each target: 1. Runs batch_evaluate_rubric*_hybrid.py on all projects/methods 2. Generates summary reports with summarize_rubric*_results.py 3. Creates all output files in data/evaluation_llm/ Complete workflow: make evaluate-and-report Or run individual steps: make evaluate-rubric10-all # Just Rubric10 make evaluate-rubric20-all # Just Rubric20 make gen-gc-approach-table # Just generate table from existing results

This commit adds conversational Claude Code agents that perform semantic quality evaluation of D4D datasheets with enhanced analysis capabilities. New Files: - .claude/agents/d4d-rubric10-semantic.md (576 lines) * Extends rubric10 with semantic analysis, correctness validation, consistency checking * DOI/grant number format validation with prefix plausibility * Cross-field consistency checks (human subjects → IRB approval) * Content accuracy assessment (ethics claims, deidentification methods) - .claude/agents/d4d-rubric20-semantic.md (635 lines) * Extends rubric20 with same semantic enhancements * Adapted for 20-question format with 0-5 scoring scale * FAIR principle alignment checking * Category-specific semantic analysis Updated Files: - notes/RUBRIC_AGENT_USAGE.md (+264 lines) * Added comprehensive Semantic Evaluation Agents section * Usage examples comparing standard vs semantic evaluation * Detailed capability descriptions and when to use each agent * Example outputs showing semantic issue detection Key Features: ✅ Semantic Understanding - Checks if content matches expected meaning ✅ Correctness Validation - DOI/grant number/RRID format + plausibility ✅ Consistency Checking - Cross-field logical relationships ✅ Content Accuracy - Ethics claims, funding patterns, temporal logic Usage: - "Evaluate [file] with rubric10-semantic" - "Run semantic FAIR compliance check using rubric20-semantic" Implementation: - Conversational agents (no API calls required) - Temperature: 0.0 (deterministic) - Model: claude-sonnet-4-5-20250929 - Enhanced JSON output with semantic_analysis section 🤖 Generated with Claude Code

Update all four rubric agent definitions to specify D4D_Evaluation_Summary schema compliance for batch evaluation outputs: - .claude/agents/d4d-rubric10.md: Add batch summary section with element_performance structure (10 elements, max 50 points) - .claude/agents/d4d-rubric20.md: Add batch summary section with category_performance structure (4 categories, max 84 points) - .claude/agents/d4d-rubric10-semantic.md: Add batch summary with semantic_analysis_summary (issue tracking, consistency checks) - .claude/agents/d4d-rubric20-semantic.md: Add batch summary with semantic_analysis_summary by category All agents now specify: - evaluation_summary.yaml conforming to EvaluationSummary class - Required fields: overall_performance, method_comparison, project_comparison, element/category_performance - Semantic agents include semantic_analysis_summary with issue breakdown, common consistency/correctness issues, and insights - Additional outputs: all_scores.csv, summary_report.md References: src/data_sheets_schema/schema/D4D_Evaluation_Summary.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Generated summary reports for the rubric10-semantic evaluation of all 16 concatenated D4D files (15 successfully processed, 1 format incompatible). Files added: - data/evaluation_llm/rubric10_semantic/all_scores.csv - CSV with all scores - data/evaluation_llm/rubric10_semantic/summary_report.md - Human-readable report - scripts/generate_rubric10_semantic_summary_simple.py - Summary generator Key findings: - Average score: 28.2/50 (56.4%) - Best performer: VOICE claudecode_agent - 46/50 (92.0%) - Worst performer: CHORUS gpt5 - 7/50 (14.0%) - By method: claudecode_agent leads with 66.7% average - By project: VOICE scores highest with 75.0% average Note: Simplified summary due to varying JSON output formats from different evaluation agents. For detailed semantic analysis, refer to individual evaluation JSON files in data/evaluation_llm/rubric10_semantic/concatenated/. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Copilot

Pull request overview

Copilot reviewed 54 out of 346 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

data/evaluation_llm/gc_approach_comparison.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot · 2025-12-09T20:43:01Z

@realmarcin I've opened a new pull request, #102, to work on those changes. Once the pull request is ready, I'll request review from you.

caufieldjh · 2025-12-09T22:09:57Z

src/data_sheets_schema/schema/data_sheets_schema.yaml

@@ -76,6 +76,7 @@ classes:
      resources:
        range: Dataset
        multivalued: true
+        inlined_as_list: true


It may be feasible to just define resources as a slot and then include it as a slot wherever it's used, rather than having two different attributes with the same name

Added comprehensive D4D YAML files generated using the claudecode_agent method: - AI_READI_d4d.yaml (31 KB) - Score: 31/50 (62.0%) - CHORUS_d4d.yaml (18 KB) - Score: 26/50 (52.0%) - CM4AI_d4d.yaml (37 KB) - Score: 28/50 (56.0%) - VOICE_d4d.yaml (68 KB) - Score: 46/50 (92.0%) - Best performer These files were evaluated using rubric10-semantic and represent the highest-scoring generation method (66.7% average) in the evaluation. Generated: December 6, 2024 (VOICE updated December 6, 2024) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Generated human-readable HTML documentation for all 4 claudecode_agent concatenated D4D files: - AI_READI_d4d_human_readable.html (40 KB) - CHORUS_d4d_human_readable.html (37 KB) - CM4AI_d4d_human_readable.html (44 KB) - VOICE_d4d_human_readable.html (59 KB) Added rendering script: - scripts/render_claudecode_agent_html.py These HTML files provide accessible, human-readable documentation of the D4D metadata for each project using the HumanReadableRenderer template. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Added detailed JSON evaluation results for all 16 concatenated D4D files (4 projects × 4 generation methods) using rubric10-semantic rubric. Includes comprehensive semantic analysis with: - Element-by-element scoring (10 elements, 50 sub-elements) - Identifier validation (DOI, grant numbers, IRB, URLs) - Consistency checking across related fields - Completeness gap analysis (high/medium/low priority) - Strengths and weaknesses documentation - Actionable recommendations Files: 16 evaluation JSON files (15 successful, 1 format incompatible) Total size: ~450 KB of detailed semantic analysis Evaluation conducted: December 9, 2024 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Generated human-readable HTML evaluation reports for all 4 claudecode_agent concatenated D4D files: - AI_READI_evaluation.html (60 KB) - Score: 31/50 (62.0%) - CHORUS_evaluation.html (15 KB) - Score: 26/50 (52.0%) - CM4AI_evaluation.html (11 KB) - Score: 28/50 (56.0%) - VOICE_evaluation.html (12 KB) - Score: 46/50 (92.0%) - Best performer Added rendering script: - scripts/render_evaluation_html.py HTML files include: - Overall scores and metadata - Element-by-element scoring with rationales - Semantic analysis (identifier validation, consistency checks) - Completeness gaps (high/medium/low priority) - Strengths and weaknesses - Actionable recommendations Files are stored alongside the D4D documentation HTML in: data/d4d_html/concatenated/claudecode_agent/ 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Updated LinkML schema modules with improvements: - D4D_Base_import: Enhanced base classes and shared enums - D4D_Collection: Improved collection process metadata - D4D_Composition: Enhanced composition and structure fields - D4D_Data_Governance: Updated governance and licensing - D4D_Distribution: Improved distribution metadata - D4D_Ethics: Enhanced ethics and human subjects fields - D4D_Maintenance: Updated maintenance and versioning - D4D_Motivation: Improved motivation and funding metadata - D4D_Preprocessing: Enhanced preprocessing documentation - D4D_Uses: Updated use cases and recommendations - data_sheets_schema.yaml: Added new Dataset attributes - data_sheets_schema_all.yaml: Regenerated merged schema Regenerated artifacts: - Python datamodel (data_sheets_schema.py) - JSON-LD context - JSON Schema - OWL ontology Total changes: 12 schema files, ~6,000 lines modified 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Added comprehensive help sections for: - Download & preprocess sources (new section) - Interactive Claude Code approaches (Agent vs Assistant) - Slash command documentation (/d4d-agent, /d4d-assistant) - Enhanced extraction command descriptions Updated prompt documentation: - src/download/prompts/claudecode/README.md Improves discoverability of D4D pipeline features and workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Updated all 4 claudecode method D4D files to align with enhanced schema: - AI_READI_d4d.yaml - CHORUS_d4d.yaml - CM4AI_d4d.yaml - VOICE_d4d.yaml Removed obsolete file: - CM4AI_d4d_regenerated.yaml Changes reflect schema updates and improved metadata completeness. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Added slash commands for D4D workflows: - /d4d-agent - Parallel agent-based D4D generation - /d4d-assistant - In-session assistant-based D4D generation - /d4d-webfetch - Web-based D4D extraction - README.md - Slash command documentation Added documentation: - notes/D4D_AGENT_GITHUB_UNIFICATION.md - Agent workflow guide - notes/RUBRIC10_EVALUATION_GUIDE.md - Evaluation rubric guide - scripts/generate_rubric10_semantic_summary.py - Evaluation summary generator These tools support the D4D metadata extraction and evaluation workflows. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Added claudecode_assistant concatenated D4D files: - AI_READI_d4d.yaml - CHORUS_d4d.yaml - CM4AI_d4d.yaml - VOICE_d4d.yaml Updated extraction reports: - data/raw/organized_extraction_report.md - data/raw/organized_extraction_summary.json The claudecode_assistant method represents in-session synthesis with direct user interaction during generation. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Convert external_resources from duplicated class attribute definitions to a shared LinkML slot following best practices. Changes: - Add external_resources slot to D4D_Base_import.yaml with dcterms:references URI - Update ExternalResource class to use slots + slot_usage (range: string) - Update Dataset class to use slots + slot_usage (range: ExternalResource) - Remove duplicate attribute definition from Dataset Benefits: - Single source of truth for external_resources semantics - Follows LinkML best practices for slot reuse - Easier maintenance with centralized slot_uri and properties - No functional changes - maintains backward compatibility The recursive structure still works correctly: - Dataset.external_resources → list of ExternalResource objects - ExternalResource.external_resources → list of URL strings All tests pass and existing D4D YAML files validate successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Convert resources from duplicated class attribute definitions to a shared LinkML slot, following the same pattern as external_resources. Changes: - Add resources slot to D4D_Base_import.yaml with range: Dataset - Update DatasetCollection class to use slots + slot_usage - Update Dataset class to use slots + slot_usage with nested description - Remove duplicate attribute definitions Benefits: - Eliminates "Ambiguous attribute: resources" OWL generation warnings - Single source of truth for resources semantics - Follows LinkML best practices for slot reuse - Consistent pattern with external_resources refactoring - No functional changes - maintains backward compatibility The recursive structure still works correctly: - DatasetCollection.resources → list of Dataset objects - Dataset.resources → list of nested Dataset objects (sub-resources) All tests pass with no ambiguous attribute warnings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 52 out of 544 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

realmarcin and others added 2 commits December 3, 2025 23:12

realmarcin requested a review from Copilot December 4, 2025 21:43

Copilot started reviewing on behalf of realmarcin December 4, 2025 21:43 View session

Copilot finished reviewing on behalf of realmarcin December 4, 2025 21:46

Copilot AI reviewed Dec 4, 2025

View reviewed changes

realmarcin and others added 21 commits December 5, 2025 16:42

Update src/download/prompt_loader.py

0b45bbb

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/download/prompt_loader.py

f05d2d0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/download/prompt_loader.py

66145ba

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update data/d4d_concatenated/claudecode/SCHEMA_FIXES_REPORT.md

464e606

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

realmarcin requested review from caufieldjh and Copilot December 9, 2025 20:40

Copilot AI reviewed Dec 9, 2025

View reviewed changes

data/evaluation_llm/gc_approach_comparison.md Show resolved Hide resolved

Update data/evaluation_llm/gc_approach_comparison.md

67fd027

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI mentioned this pull request Dec 9, 2025

Replace fragile parent traversal with marker-based project root detection #102

Merged

caufieldjh reviewed Dec 9, 2025

View reviewed changes

realmarcin and others added 11 commits December 9, 2025 17:48

realmarcin requested a review from Copilot December 10, 2025 07:55

Copilot AI reviewed Dec 10, 2025

View reviewed changes

realmarcin merged commit 02021d0 into main Dec 10, 2025
3 checks passed

realmarcin deleted the prompt-explore branch December 10, 2025 07:59

realmarcin restored the prompt-explore branch December 10, 2025 20:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompt explore #99

Prompt explore #99

Uh oh!

realmarcin commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Uh oh!

realmarcin Dec 9, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI commented Dec 9, 2025

Uh oh!

caufieldjh Dec 9, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Prompt explore #99

Prompt explore #99

Uh oh!

Conversation

realmarcin commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

realmarcin Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI commented Dec 9, 2025

Uh oh!

caufieldjh Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants