Extract by realmarcin · Pull Request #46 · bridge2ai/data-sheets-schema

realmarcin · 2025-09-17T01:42:48Z

No description provided.

- Implemented validated D4D wrapper with comprehensive validation - Added download success and project relevance validation - Integrated OpenAI o3-mini model for automated D4D YAML generation - Created file format preference system (txt > html > pdf > json) - Added organized dataset extractor for Google Sheets processing - Generated 8 complete D4D metadata files across 4 project columns - Added dependencies: pyyaml, requests for data processing - Created manual Claude Pro/Max workflow alternative 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Generated D4D YAML files for all valid input documents using GPT-5 - Added comprehensive metadata files for each extraction with full provenance - Updated validated_d4d_wrapper.py with metadata generation capabilities - Includes input file details, schema versions, model info, validation results - All outputs organized by project columns (AI_READI, CHORUS, CM4AI, VOICE) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Regenerated fairhub_d4d.yaml and physionet_b2ai-voice_1.1_d4d.yaml that failed due to YAML syntax errors - Added corresponding metadata files for both fixed extractions - Created fix_failed_extractions.py script with improved YAML validation - All D4D files now have complete metadata coverage (8/8 files with metadata) Fixed files: - AI_READI/fairhub_d4d.yaml + metadata - VOICE/physionet_b2ai-voice_1.1_d4d.yaml + metadata 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Copilot

Pull Request Overview

This pull request extracts a comprehensive set of data download and processing tools for handling datasets from Google Sheets and various data repositories. The main purpose is to create a robust pipeline for downloading, validating, and processing dataset metadata from multiple sources.

Key changes include:

Implementation of validated D4D wrapper with comprehensive download and content validation
Creation of standalone processors for different use cases (Claude Pro/Max, API-based)
Addition of organized dataset extraction that maintains spreadsheet column structure

Reviewed Changes

Copilot reviewed 54 out of 58 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/download/validated_d4d_wrapper.py	Main validated wrapper with download success validation, project relevance checking, and D4D YAML generation
src/download/standalone_d4d_wrapper.py	Simplified standalone processor using pydantic-ai without aurelian dependencies
src/download/organized_dataset_extractor.py	Column-aware dataset extractor that organizes downloads by spreadsheet headers
src/download/enhanced_sheet_downloader.py	Enhanced downloader with detailed analysis and file type detection
src/download/claude_max_d4d_processor.py	Manual workflow processor for Claude Pro/Max users
pyproject.toml	Added required dependencies for requests, pydantic-ai, beautifulsoup4, and pdfminer-six

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

- Document validated D4D wrapper (recommended) - Document basic D4D wrapper - Add test script usage - Include requirements and script locations - Update CLAUDE.md with project context 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add `make test-modules` to validate all individual D4D module schemas - Add `make lint-modules` to lint all individual D4D module schemas - Update help menu with new commands - Update CLAUDE.md with module validation workflow - Document D4D Agent scripts and custom Makefile targets in CLAUDE.md These targets enable faster validation during module development without requiring full schema regeneration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…rings in HTML Schema and Python Code: - Confirmed Python datamodel uses None as default for optional fields - Schema uses default_range: string with no ifabsent rules - YAML data files use null or omit fields for missing values HTML Rendering Updates: - Updated src/html/human_readable_renderer.py to convert None → "" - format_value(): Added None check at top level - _format_string(): Returns "" instead of "Not specified" - _format_dict/list/table methods: Return "" instead of placeholder messages - All table formatters return empty strings for null values - Updated src/renderer/yaml_renderer.py for consistent None → "" conversion - Added clarifying comments for HTML and PDF rendering Documentation: - Added comprehensive "Null/Empty Value Handling" section to CLAUDE.md - Documented the pattern: None in schema/Python, empty strings in HTML - Provided examples showing YAML → HTML conversion behavior This ensures clean HTML output without "Not specified" or "-" placeholders for missing values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Script Features (src/download/concatenate_documents.py): - Concatenates documents from a directory in reproducible alphabetical order - Includes table of contents with all file paths - Adds headers with filename, path, and size for each document - Handles multiple encodings (UTF-8, UTF-8-sig, latin-1) - Supports filtering by file extension - Supports recursive directory search - Customizable separators between documents - Optional headers and summary sections Makefile Targets: - make concat-docs INPUT_DIR=dir OUTPUT_FILE=file Generic concatenation with required parameters Optional: EXTENSIONS=".txt .md" RECURSIVE=true - make concat-extracted Concatenates D4D YAML files from data/extracted_by_column/ Creates one file per project column (AI_READI, CHORUS, CM4AI, VOICE) Output: data/concatenated/{column}_d4d.txt - make concat-downloads Concatenates raw downloads from downloads_by_column/ Creates one file per project column Output: data/concatenated/{column}_raw.txt Documentation: - Added comprehensive "Document Concatenation" section to CLAUDE.md - Documented all command options and use cases - Updated help menu in Makefile Use Cases: - Combine all downloaded dataset documentation for a project - Create single input documents for LLM processing - Merge documentation fragments into complete documents - Aggregate logs or reports from multiple files Tested with data/extracted_by_column/: - AI_READI: 6 files → 18K - CHORUS: 2 files → 4.0K - CM4AI: 4 files → 6.4K - VOICE: 4 files → 19K 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Script Features (src/download/process_concatenated_d4d.py): - Processes concatenated documents created by concatenate_documents.py - Uses aurelian's D4D agent to synthesize multiple D4D YAML files - Intelligently merges complementary information from all sources - Prefers detailed/specific information over generic content - Keeps the most comprehensive descriptions - Combines all relevant metadata sections - Supports both single file and batch directory processing Makefile Targets: - make process-concat INPUT_FILE=data/concatenated/AI_READI_d4d.txt Process a single concatenated file with D4D agent Optional: OUTPUT_FILE=path/to/output.yaml MODEL=model-name - make process-all-concat Process all concatenated files in data/concatenated/ Creates synthesized D4D YAML files in data/synthesized/ Output: data/synthesized/{PROJECT}_synthesized.yaml Advanced System Prompt: - Specialized for synthesizing multiple D4D YAML files - Merges information from different documentation sources - Creates comprehensive dataset documentation - Handles metadata from concatenated files Workflow Integration: 1. Download → Individual files per source 2. Extract → validated_d4d_wrapper.py generates D4D YAML per file 3. Concatenate → make concat-extracted merges files by project 4. Synthesize → make process-all-concat creates comprehensive D4D YAML Documentation: - Added "Process Concatenated Documents" section to CLAUDE.md - Documented typical workflow and use cases - Included script usage examples with uv run - Updated help menu in Makefile Requirements: - Run from aurelian/ directory using: uv run python ../src/download/... - Set ANTHROPIC_API_KEY environment variable - Uses Claude 3.5 Sonnet by default - Can specify alternative models with -m/--model flag Output Structure: - data/extracted_by_column/{PROJECT}/ - Individual D4D YAML files - data/concatenated/{PROJECT}_d4d.txt - Concatenated documents - data/synthesized/{PROJECT}_synthesized.yaml - Comprehensive synthesized D4D 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Changed from poetry run to cd aurelian && uv run - Script now runs in correct aurelian environment with all dependencies - Added check for aurelian directory existence - Fixed relative paths for input/output files - Both process-concat and process-all-concat targets updated This fixes ModuleNotFoundError issues when running D4D agent on concatenated documents. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Changes: - Changed default model from Claude 3.5 Sonnet to OpenAI GPT-5 - Updated API key checks to support both OPENAI_API_KEY and ANTHROPIC_API_KEY - Script now checks for appropriate API key based on model prefix - Updated documentation to reflect GPT-5 as default API Key Logic: - openai: models require OPENAI_API_KEY - anthropic: models require ANTHROPIC_API_KEY - Other models check for either key Testing: - Successfully processed AI_READI concatenated document with GPT-5 - Agent completed in 64.4s - Generated valid 9.3K YAML with comprehensive synthesized metadata Updated CLAUDE.md: - Concatenated processor uses GPT-5 by default - Both API keys documented 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Issue: VOICE processing was failing due to unquoted colons in string values (e.g., "title: Bridge2AI-Voice: An ethically-sourced...") Fixes: 1. Added automatic YAML fixing for unquoted strings containing colons - Detects lines like "field: value: with colon" - Automatically quotes such values to make valid YAML - Regex-based line-by-line processing 2. Enhanced error handling and debugging: - Added detailed progress messages at each step - Response length logging - Empty response detection - None value detection after parsing - Debug file output for invalid YAML - Full traceback printing for exceptions 3. Better validation: - Check for empty yaml_content before validation - Check for None after safe_load - Save invalid YAML to _debug.txt for troubleshooting Results: ✅ AI_READI: 9.0K YAML (64.4s) ✅ CHORUS: 805B YAML (37.3s) ✅ CM4AI: 1.9K YAML (39.3s) ✅ VOICE: 12K YAML (192.1s) - NOW WORKING! All 4 concatenated documents successfully synthesized into comprehensive D4D YAML metadata using GPT-5. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit adds comprehensive project-level D4D datasheets synthesized from multiple documents using GPT-5, along with GitHub Pages integration to make them accessible alongside existing individual dataset datasheets. Key additions: 1. Concatenated documents by project (AI-READI, CHORUS, CM4AI, VOICE) 2. GPT-5 synthesized D4D YAML files for each project 3. Human-readable HTML renderings of synthesized datasheets 4. GitHub Pages documentation page linking to all D4D examples 5. Updated Makefile to include HTML files in documentation builds New files: - src/html/render_concatenated.py: Script to render synthesized YAML to HTML - src/docs/d4d_examples.md: GitHub Pages documentation for D4D examples - data/concatenated/*.txt: Concatenated project documents - data/concatenated/*_synthesized.yaml: GPT-5 synthesized D4D metadata - src/html/output/concatenated/*_synthesized.html: Rendered HTML datasheets Modified files: - Makefile: Updated gendoc target to copy HTML files to docs/ - mkdocs.yml: Added D4D Examples page to navigation The synthesized datasheets provide comprehensive project-level metadata by combining information from multiple sources, offering a complementary view to the individual dataset datasheets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Renamed the three main D4D datasheet files in data/sheets/ from .txt to .yaml to properly reflect their YAML format. This improves clarity and ensures correct file type association. Changes: - Renamed D4D_-_AI-READI_FAIRHub_v3.txt → .yaml - Renamed D4D_-_CM4AI_Dataverse_v3.txt → .yaml - Renamed D4D_-_VOICE_PhysioNet_v3.txt → .yaml - Updated process_text_files.py to reference .yaml files - Updated src/html/README.md documentation - Regenerated all HTML output files to ensure consistency All HTML rendering scripts tested and working correctly with the renamed files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…son) Renamed structured data files in data/GC_data_sheets/output/ to have correct file extensions based on their actual format. This improves clarity and enables proper syntax highlighting and file type association. Changes: - 3 files renamed to .yaml (YAML format) - D4D - AI-READI FAIRHub_raw.yaml - D4D - CM4AI_raw.yaml - D4D - Voice Health Data Nexus_raw.yaml - 8 files renamed to .json (JSON format) - D4D - Minimal - CM4AI RO-Crate_raw.json - D4D Collection - AI-READI FAIRHub_raw.json - D4D Human - Voice - Health Data Nexus_raw.json - D4D Metadata - CM4AI biorxiv_raw.json - D4D Minimal - AI-READI FairHub_raw.json - D4D Minimal - Voice - Health Data Nexus_raw.json - D4D Minimal - Voice Physionet_raw.json - D4D Minimal -- CM4AI Dataverse beta_raw.json Note: Files containing prompts/instructions remain as .txt files. Non-raw files (formatted reports) also remain as .txt. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Renamed data/concatenated → data/sheets_concatenated and updated all file naming conventions for better clarity and consistency across the project. Directory changes: - data/concatenated/ → data/sheets_concatenated/ File naming changes: - *_d4d.txt → *_d4d_concatenated.txt (concatenated input files) - *_synthesized.yaml → *_alldocs.yaml (synthesized from all docs) - *_synthesized.html → *_alldocs.html (HTML renderings) Updated references in: - src/html/render_concatenated.py - src/download/process_concatenated_d4d.py - project.Makefile (all concat and process targets) - CLAUDE.md (documentation and examples) - src/docs/d4d_examples.md (renamed "Synthesized" → "All-Documents") The new naming better reflects the purpose: - "_concatenated.txt" clearly indicates these are merged input files - "_alldocs.yaml" indicates these are synthesized from all available documents - "sheets_concatenated" groups these files with other data sheets All Makefile targets and scripts updated to use new paths and naming. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Removed all incomplete/intermediate .txt report files and renamed structured data files to have clean extensions without _raw suffix. Changes: - Deleted 21 .txt files (incomplete reports and LLM prompts) - Renamed 3 _raw.yaml → .yaml (complete D4D datasheets) - Renamed 8 _raw.json → .json (D4D subset datasheets) Final output directory contains only clean structured data files: - 3 YAML files: Full D4D datasheets for AI-READI, CM4AI, VOICE - 8 JSON files: D4D subsets (Collection, Minimal, Metadata, Human) All files are now schema-compliant structured data without redundant suffixes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit resolves YAML parsing failures in the D4D extraction pipeline by improving the handling of unquoted strings containing colons (in titles, URLs, dates, etc.) and refining download validation to avoid false positives. Changes: - Enhanced YAML fixing logic to properly quote all values containing colons - Fixed overly aggressive "not found" validation that flagged legitimate documentation - Added debug output for YAML validation failures - Successfully processed previously failed dataverse_10.18130_V3_B35XWX dataset Results: - 9 total D4D extractions now successful (was 6, added 3 fixes) - 1 remaining failure is a genuine Google Docs auth issue (not processable) - All YAML parsing issues resolved 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Updated aurelian submodule to commit 9f61790 on bridge2ai-d4d-extraction branch, which includes D4D extraction tools, test files, and examples developed for the Bridge2AI data-sheets-schema project. Tools added to aurelian: - Sheet downloader for metadata collection - Batch D4D extraction with GPT-5 - D4D agent demo and test scripts - Example D4D outputs (Iris, MIMIC-CXR) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This commit completes per-document D4D extraction for all processable input documents and corrects naming conventions for concatenated input files. Changes: 1. Fixed file selection logic to process JSON files separately - JSON files with different row numbers now generate separate D4D outputs - Previously, doi_row2.json and doi_row9.json were grouped together 2. Generated missing per-document D4D extractions: - AI_READI/doi_row9_d4d.yaml (Zenodo dataset) - AI_READI/fairhub_row10_d4d.yaml - CM4AI/doi_row3_d4d.yaml - VOICE/healthnexus_row13_d4d.yaml 3. Renamed row-specific D4D files for clarity: - doi_d4d.yaml → doi_row2_d4d.yaml (AI_READI) - fairhub_d4d.yaml → fairhub_row10/13_d4d.yaml (AI_READI) - doi_d4d.yaml → doi_row3_d4d.yaml (CM4AI) - healthnexus_d4d.yaml → healthnexus_row13_d4d.yaml (VOICE) 4. Fixed concatenated input file naming: - Renamed *_d4d_concatenated.txt → *_concatenated.txt - These are INPUT files (concatenated source documents), not D4D outputs - Updated Makefile and process_concatenated_d4d.py to use new naming 5. Removed old debug file and duplicate non-row-specific files Results: - 11 total per-document D4D extractions (was 9, added 2 missing) - All processable input documents now have corresponding D4D records - Clear naming distinction between inputs (*_concatenated.txt) and outputs (*_alldocs.yaml) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Distinguish between two D4D generation approaches: 1. Automated LLM API Agents (batch processing) 2. Interactive Coding Agents (human oversight) README.md: - Add comprehensive section on D4D Metadata Generation - Document Approach 1: API agents (wrappers, Aurelian library) - Document Approach 2: Interactive agents (Claude Code, etc.) - Add comparison table for when to use each approach - Include examples for both approaches - Define generation metadata standards validated_d4d_wrapper.py: - Enhance metadata header with generator information - Add ISO 8601 timestamp format - Include schema URL in header - Add method (automated) to metadata - Improve traceability of generated files aurelian submodule: - Update to include HTML/JSON file support - Prompt duplication elimination - Enhanced D4D agent capabilities 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

monicacecilia

@realmarcin, I think merging #45 created new conflicts that weren't there earlier today. Alerting you to review before merging.

Resolved conflicts by merging dependencies from both branches: - Kept D4D extraction dependencies (requests, pydantic-ai, beautifulsoup4, pdfminer-six) - Kept Synapse integration dependencies (pandas, synapseclient) - Regenerated poetry.lock with all dependencies Files merged: - pyproject.toml: Combined dependencies from both branches - poetry.lock: Regenerated to include all dependencies New files from main: - .github/workflows/d4d_to_synapse_table.yml - utils/d4d_to_synapse_table.py

realmarcin and others added 5 commits September 8, 2025 22:09

download GC files

5df9f39

downloads

bd65928

realmarcin requested a review from Copilot September 17, 2025 01:42

Copilot AI reviewed Sep 17, 2025

View reviewed changes

realmarcin assigned caufieldjh Oct 28, 2025

realmarcin and others added 16 commits October 28, 2025 12:20

monicacecilia self-requested a review October 31, 2025 15:59

monicacecilia approved these changes Oct 31, 2025

View reviewed changes

This comment was marked as duplicate.

Sign in to view

monicacecilia requested changes Oct 31, 2025

View reviewed changes

realmarcin merged commit 0982496 into main Nov 1, 2025
2 checks passed

realmarcin deleted the extract branch November 1, 2025 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract#46

Extract#46
realmarcin merged 23 commits intomainfrom
extract

realmarcin commented Sep 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

This comment was marked as duplicate.

monicacecilia left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

realmarcin commented Sep 17, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

This comment was marked as duplicate.

monicacecilia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants