Skip to content

Extract#46

Merged
realmarcin merged 23 commits intomainfrom
extract
Nov 1, 2025
Merged

Extract#46
realmarcin merged 23 commits intomainfrom
extract

Conversation

@realmarcin
Copy link
Collaborator

No description provided.

realmarcin and others added 5 commits September 8, 2025 22:09
- Implemented validated D4D wrapper with comprehensive validation
- Added download success and project relevance validation
- Integrated OpenAI o3-mini model for automated D4D YAML generation
- Created file format preference system (txt > html > pdf > json)
- Added organized dataset extractor for Google Sheets processing
- Generated 8 complete D4D metadata files across 4 project columns
- Added dependencies: pyyaml, requests for data processing
- Created manual Claude Pro/Max workflow alternative

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Generated D4D YAML files for all valid input documents using GPT-5
- Added comprehensive metadata files for each extraction with full provenance
- Updated validated_d4d_wrapper.py with metadata generation capabilities
- Includes input file details, schema versions, model info, validation results
- All outputs organized by project columns (AI_READI, CHORUS, CM4AI, VOICE)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Regenerated fairhub_d4d.yaml and physionet_b2ai-voice_1.1_d4d.yaml that failed due to YAML syntax errors
- Added corresponding metadata files for both fixed extractions
- Created fix_failed_extractions.py script with improved YAML validation
- All D4D files now have complete metadata coverage (8/8 files with metadata)

Fixed files:
- AI_READI/fairhub_d4d.yaml + metadata
- VOICE/physionet_b2ai-voice_1.1_d4d.yaml + metadata

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@realmarcin realmarcin requested a review from Copilot September 17, 2025 01:42
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request extracts a comprehensive set of data download and processing tools for handling datasets from Google Sheets and various data repositories. The main purpose is to create a robust pipeline for downloading, validating, and processing dataset metadata from multiple sources.

Key changes include:

  • Implementation of validated D4D wrapper with comprehensive download and content validation
  • Creation of standalone processors for different use cases (Claude Pro/Max, API-based)
  • Addition of organized dataset extraction that maintains spreadsheet column structure

Reviewed Changes

Copilot reviewed 54 out of 58 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/download/validated_d4d_wrapper.py Main validated wrapper with download success validation, project relevance checking, and D4D YAML generation
src/download/standalone_d4d_wrapper.py Simplified standalone processor using pydantic-ai without aurelian dependencies
src/download/organized_dataset_extractor.py Column-aware dataset extractor that organizes downloads by spreadsheet headers
src/download/enhanced_sheet_downloader.py Enhanced downloader with detailed analysis and file type detection
src/download/claude_max_d4d_processor.py Manual workflow processor for Claude Pro/Max users
pyproject.toml Added required dependencies for requests, pydantic-ai, beautifulsoup4, and pdfminer-six

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

realmarcin and others added 16 commits October 28, 2025 12:20
- Document validated D4D wrapper (recommended)
- Document basic D4D wrapper
- Add test script usage
- Include requirements and script locations
- Update CLAUDE.md with project context

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add `make test-modules` to validate all individual D4D module schemas
- Add `make lint-modules` to lint all individual D4D module schemas
- Update help menu with new commands
- Update CLAUDE.md with module validation workflow
- Document D4D Agent scripts and custom Makefile targets in CLAUDE.md

These targets enable faster validation during module development
without requiring full schema regeneration.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…rings in HTML

Schema and Python Code:
- Confirmed Python datamodel uses None as default for optional fields
- Schema uses default_range: string with no ifabsent rules
- YAML data files use null or omit fields for missing values

HTML Rendering Updates:
- Updated src/html/human_readable_renderer.py to convert None → ""
  - format_value(): Added None check at top level
  - _format_string(): Returns "" instead of "Not specified"
  - _format_dict/list/table methods: Return "" instead of placeholder messages
  - All table formatters return empty strings for null values
- Updated src/renderer/yaml_renderer.py for consistent None → "" conversion
  - Added clarifying comments for HTML and PDF rendering

Documentation:
- Added comprehensive "Null/Empty Value Handling" section to CLAUDE.md
- Documented the pattern: None in schema/Python, empty strings in HTML
- Provided examples showing YAML → HTML conversion behavior

This ensures clean HTML output without "Not specified" or "-" placeholders
for missing values.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Script Features (src/download/concatenate_documents.py):
- Concatenates documents from a directory in reproducible alphabetical order
- Includes table of contents with all file paths
- Adds headers with filename, path, and size for each document
- Handles multiple encodings (UTF-8, UTF-8-sig, latin-1)
- Supports filtering by file extension
- Supports recursive directory search
- Customizable separators between documents
- Optional headers and summary sections

Makefile Targets:
- make concat-docs INPUT_DIR=dir OUTPUT_FILE=file
  Generic concatenation with required parameters
  Optional: EXTENSIONS=".txt .md" RECURSIVE=true

- make concat-extracted
  Concatenates D4D YAML files from data/extracted_by_column/
  Creates one file per project column (AI_READI, CHORUS, CM4AI, VOICE)
  Output: data/concatenated/{column}_d4d.txt

- make concat-downloads
  Concatenates raw downloads from downloads_by_column/
  Creates one file per project column
  Output: data/concatenated/{column}_raw.txt

Documentation:
- Added comprehensive "Document Concatenation" section to CLAUDE.md
- Documented all command options and use cases
- Updated help menu in Makefile

Use Cases:
- Combine all downloaded dataset documentation for a project
- Create single input documents for LLM processing
- Merge documentation fragments into complete documents
- Aggregate logs or reports from multiple files

Tested with data/extracted_by_column/:
- AI_READI: 6 files → 18K
- CHORUS: 2 files → 4.0K
- CM4AI: 4 files → 6.4K
- VOICE: 4 files → 19K

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Script Features (src/download/process_concatenated_d4d.py):
- Processes concatenated documents created by concatenate_documents.py
- Uses aurelian's D4D agent to synthesize multiple D4D YAML files
- Intelligently merges complementary information from all sources
- Prefers detailed/specific information over generic content
- Keeps the most comprehensive descriptions
- Combines all relevant metadata sections
- Supports both single file and batch directory processing

Makefile Targets:
- make process-concat INPUT_FILE=data/concatenated/AI_READI_d4d.txt
  Process a single concatenated file with D4D agent
  Optional: OUTPUT_FILE=path/to/output.yaml MODEL=model-name

- make process-all-concat
  Process all concatenated files in data/concatenated/
  Creates synthesized D4D YAML files in data/synthesized/
  Output: data/synthesized/{PROJECT}_synthesized.yaml

Advanced System Prompt:
- Specialized for synthesizing multiple D4D YAML files
- Merges information from different documentation sources
- Creates comprehensive dataset documentation
- Handles metadata from concatenated files

Workflow Integration:
1. Download → Individual files per source
2. Extract → validated_d4d_wrapper.py generates D4D YAML per file
3. Concatenate → make concat-extracted merges files by project
4. Synthesize → make process-all-concat creates comprehensive D4D YAML

Documentation:
- Added "Process Concatenated Documents" section to CLAUDE.md
- Documented typical workflow and use cases
- Included script usage examples with uv run
- Updated help menu in Makefile

Requirements:
- Run from aurelian/ directory using: uv run python ../src/download/...
- Set ANTHROPIC_API_KEY environment variable
- Uses Claude 3.5 Sonnet by default
- Can specify alternative models with -m/--model flag

Output Structure:
- data/extracted_by_column/{PROJECT}/ - Individual D4D YAML files
- data/concatenated/{PROJECT}_d4d.txt - Concatenated documents
- data/synthesized/{PROJECT}_synthesized.yaml - Comprehensive synthesized D4D

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Changed from poetry run to cd aurelian && uv run
- Script now runs in correct aurelian environment with all dependencies
- Added check for aurelian directory existence
- Fixed relative paths for input/output files
- Both process-concat and process-all-concat targets updated

This fixes ModuleNotFoundError issues when running D4D agent on
concatenated documents.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Changes:
- Changed default model from Claude 3.5 Sonnet to OpenAI GPT-5
- Updated API key checks to support both OPENAI_API_KEY and ANTHROPIC_API_KEY
- Script now checks for appropriate API key based on model prefix
- Updated documentation to reflect GPT-5 as default

API Key Logic:
- openai: models require OPENAI_API_KEY
- anthropic: models require ANTHROPIC_API_KEY
- Other models check for either key

Testing:
- Successfully processed AI_READI concatenated document with GPT-5
- Agent completed in 64.4s
- Generated valid 9.3K YAML with comprehensive synthesized metadata

Updated CLAUDE.md:
- Concatenated processor uses GPT-5 by default
- Both API keys documented

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Issue: VOICE processing was failing due to unquoted colons in string values
(e.g., "title: Bridge2AI-Voice: An ethically-sourced...")

Fixes:
1. Added automatic YAML fixing for unquoted strings containing colons
   - Detects lines like "field: value: with colon"
   - Automatically quotes such values to make valid YAML
   - Regex-based line-by-line processing

2. Enhanced error handling and debugging:
   - Added detailed progress messages at each step
   - Response length logging
   - Empty response detection
   - None value detection after parsing
   - Debug file output for invalid YAML
   - Full traceback printing for exceptions

3. Better validation:
   - Check for empty yaml_content before validation
   - Check for None after safe_load
   - Save invalid YAML to _debug.txt for troubleshooting

Results:
✅ AI_READI: 9.0K YAML (64.4s)
✅ CHORUS: 805B YAML (37.3s)
✅ CM4AI: 1.9K YAML (39.3s)
✅ VOICE: 12K YAML (192.1s) - NOW WORKING!

All 4 concatenated documents successfully synthesized into comprehensive
D4D YAML metadata using GPT-5.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds comprehensive project-level D4D datasheets synthesized from
multiple documents using GPT-5, along with GitHub Pages integration to make
them accessible alongside existing individual dataset datasheets.

Key additions:
1. Concatenated documents by project (AI-READI, CHORUS, CM4AI, VOICE)
2. GPT-5 synthesized D4D YAML files for each project
3. Human-readable HTML renderings of synthesized datasheets
4. GitHub Pages documentation page linking to all D4D examples
5. Updated Makefile to include HTML files in documentation builds

New files:
- src/html/render_concatenated.py: Script to render synthesized YAML to HTML
- src/docs/d4d_examples.md: GitHub Pages documentation for D4D examples
- data/concatenated/*.txt: Concatenated project documents
- data/concatenated/*_synthesized.yaml: GPT-5 synthesized D4D metadata
- src/html/output/concatenated/*_synthesized.html: Rendered HTML datasheets

Modified files:
- Makefile: Updated gendoc target to copy HTML files to docs/
- mkdocs.yml: Added D4D Examples page to navigation

The synthesized datasheets provide comprehensive project-level metadata by
combining information from multiple sources, offering a complementary view
to the individual dataset datasheets.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Renamed the three main D4D datasheet files in data/sheets/ from .txt to .yaml
to properly reflect their YAML format. This improves clarity and ensures
correct file type association.

Changes:
- Renamed D4D_-_AI-READI_FAIRHub_v3.txt → .yaml
- Renamed D4D_-_CM4AI_Dataverse_v3.txt → .yaml
- Renamed D4D_-_VOICE_PhysioNet_v3.txt → .yaml
- Updated process_text_files.py to reference .yaml files
- Updated src/html/README.md documentation
- Regenerated all HTML output files to ensure consistency

All HTML rendering scripts tested and working correctly with the renamed files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…son)

Renamed structured data files in data/GC_data_sheets/output/ to have correct
file extensions based on their actual format. This improves clarity and enables
proper syntax highlighting and file type association.

Changes:
- 3 files renamed to .yaml (YAML format)
  - D4D - AI-READI FAIRHub_raw.yaml
  - D4D - CM4AI_raw.yaml
  - D4D - Voice Health Data Nexus_raw.yaml

- 8 files renamed to .json (JSON format)
  - D4D - Minimal - CM4AI RO-Crate_raw.json
  - D4D Collection - AI-READI FAIRHub_raw.json
  - D4D Human - Voice - Health Data Nexus_raw.json
  - D4D Metadata - CM4AI biorxiv_raw.json
  - D4D Minimal - AI-READI FairHub_raw.json
  - D4D Minimal - Voice - Health Data Nexus_raw.json
  - D4D Minimal - Voice Physionet_raw.json
  - D4D Minimal -- CM4AI Dataverse beta_raw.json

Note: Files containing prompts/instructions remain as .txt files.
Non-raw files (formatted reports) also remain as .txt.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Renamed data/concatenated → data/sheets_concatenated and updated all file
naming conventions for better clarity and consistency across the project.

Directory changes:
- data/concatenated/ → data/sheets_concatenated/

File naming changes:
- *_d4d.txt → *_d4d_concatenated.txt (concatenated input files)
- *_synthesized.yaml → *_alldocs.yaml (synthesized from all docs)
- *_synthesized.html → *_alldocs.html (HTML renderings)

Updated references in:
- src/html/render_concatenated.py
- src/download/process_concatenated_d4d.py
- project.Makefile (all concat and process targets)
- CLAUDE.md (documentation and examples)
- src/docs/d4d_examples.md (renamed "Synthesized" → "All-Documents")

The new naming better reflects the purpose:
- "_concatenated.txt" clearly indicates these are merged input files
- "_alldocs.yaml" indicates these are synthesized from all available documents
- "sheets_concatenated" groups these files with other data sheets

All Makefile targets and scripts updated to use new paths and naming.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Removed all incomplete/intermediate .txt report files and renamed structured
data files to have clean extensions without _raw suffix.

Changes:
- Deleted 21 .txt files (incomplete reports and LLM prompts)
- Renamed 3 _raw.yaml → .yaml (complete D4D datasheets)
- Renamed 8 _raw.json → .json (D4D subset datasheets)

Final output directory contains only clean structured data files:
- 3 YAML files: Full D4D datasheets for AI-READI, CM4AI, VOICE
- 8 JSON files: D4D subsets (Collection, Minimal, Metadata, Human)

All files are now schema-compliant structured data without redundant suffixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit resolves YAML parsing failures in the D4D extraction pipeline
by improving the handling of unquoted strings containing colons (in titles,
URLs, dates, etc.) and refining download validation to avoid false positives.

Changes:
- Enhanced YAML fixing logic to properly quote all values containing colons
- Fixed overly aggressive "not found" validation that flagged legitimate documentation
- Added debug output for YAML validation failures
- Successfully processed previously failed dataverse_10.18130_V3_B35XWX dataset

Results:
- 9 total D4D extractions now successful (was 6, added 3 fixes)
- 1 remaining failure is a genuine Google Docs auth issue (not processable)
- All YAML parsing issues resolved

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Updated aurelian submodule to commit 9f61790 on bridge2ai-d4d-extraction branch,
which includes D4D extraction tools, test files, and examples developed for
the Bridge2AI data-sheets-schema project.

Tools added to aurelian:
- Sheet downloader for metadata collection
- Batch D4D extraction with GPT-5
- D4D agent demo and test scripts
- Example D4D outputs (Iris, MIMIC-CXR)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit completes per-document D4D extraction for all processable input
documents and corrects naming conventions for concatenated input files.

Changes:

1. Fixed file selection logic to process JSON files separately
   - JSON files with different row numbers now generate separate D4D outputs
   - Previously, doi_row2.json and doi_row9.json were grouped together

2. Generated missing per-document D4D extractions:
   - AI_READI/doi_row9_d4d.yaml (Zenodo dataset)
   - AI_READI/fairhub_row10_d4d.yaml
   - CM4AI/doi_row3_d4d.yaml
   - VOICE/healthnexus_row13_d4d.yaml

3. Renamed row-specific D4D files for clarity:
   - doi_d4d.yaml → doi_row2_d4d.yaml (AI_READI)
   - fairhub_d4d.yaml → fairhub_row10/13_d4d.yaml (AI_READI)
   - doi_d4d.yaml → doi_row3_d4d.yaml (CM4AI)
   - healthnexus_d4d.yaml → healthnexus_row13_d4d.yaml (VOICE)

4. Fixed concatenated input file naming:
   - Renamed *_d4d_concatenated.txt → *_concatenated.txt
   - These are INPUT files (concatenated source documents), not D4D outputs
   - Updated Makefile and process_concatenated_d4d.py to use new naming

5. Removed old debug file and duplicate non-row-specific files

Results:
- 11 total per-document D4D extractions (was 9, added 2 missing)
- All processable input documents now have corresponding D4D records
- Clear naming distinction between inputs (*_concatenated.txt) and outputs (*_alldocs.yaml)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@monicacecilia monicacecilia self-requested a review October 31, 2025 15:59
Distinguish between two D4D generation approaches:
1. Automated LLM API Agents (batch processing)
2. Interactive Coding Agents (human oversight)

README.md:
- Add comprehensive section on D4D Metadata Generation
- Document Approach 1: API agents (wrappers, Aurelian library)
- Document Approach 2: Interactive agents (Claude Code, etc.)
- Add comparison table for when to use each approach
- Include examples for both approaches
- Define generation metadata standards

validated_d4d_wrapper.py:
- Enhance metadata header with generator information
- Add ISO 8601 timestamp format
- Include schema URL in header
- Add method (automated) to metadata
- Improve traceability of generated files

aurelian submodule:
- Update to include HTML/JSON file support
- Prompt duplication elimination
- Enhanced D4D agent capabilities

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@monicacecilia

This comment was marked as duplicate.

Copy link
Contributor

@monicacecilia monicacecilia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realmarcin, I think merging #45 created new conflicts that weren't there earlier today. Alerting you to review before merging.

Resolved conflicts by merging dependencies from both branches:
- Kept D4D extraction dependencies (requests, pydantic-ai, beautifulsoup4, pdfminer-six)
- Kept Synapse integration dependencies (pandas, synapseclient)
- Regenerated poetry.lock with all dependencies

Files merged:
- pyproject.toml: Combined dependencies from both branches
- poetry.lock: Regenerated to include all dependencies

New files from main:
- .github/workflows/d4d_to_synapse_table.yml
- utils/d4d_to_synapse_table.py
@realmarcin realmarcin merged commit 0982496 into main Nov 1, 2025
2 checks passed
@realmarcin realmarcin deleted the extract branch November 1, 2025 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants