bridge2ai
diff --git a/‎.DS_Store‎
6 KB b/‎.DS_Store‎
6 KB
diff --git a/‎CLAUDE.md‎
Lines changed: 302 additions & 25 deletions b/‎CLAUDE.md‎
Lines changed: 302 additions & 25 deletions
diff --git a/‎Makefile‎
Lines changed: 30 additions & 1 deletion b/‎Makefile‎
Lines changed: 30 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 196 additions & 0 deletions b/‎README.md‎
Lines changed: 196 additions & 0 deletions
diff --git a/‎aurelian‎
Lines changed: 1 addition & 0 deletions b/‎aurelian‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎data/.DS_Store‎
6 KB b/‎data/.DS_Store‎
6 KB
diff --git a/‎data/GC_data_sheets/output/D4D - AI-READI FAIRHub.txt‎
Lines changed: 0 additions & 16 deletions b/‎data/GC_data_sheets/output/D4D - AI-READI FAIRHub.txt‎
Lines changed: 0 additions & 16 deletions
diff --git a/‎…ts/output/D4D - AI-READI FAIRHub_raw.txt‎ ‎…heets/output/D4D - AI-READI FAIRHub.yaml‎data/GC_data_sheets/output/D4D - AI-READI FAIRHub_raw.txt renamed to data/GC_data_sheets/output/D4D - AI-READI FAIRHub.yaml b/‎…ts/output/D4D - AI-READI FAIRHub_raw.txt‎ ‎…heets/output/D4D - AI-READI FAIRHub.yaml‎data/GC_data_sheets/output/D4D - AI-READI FAIRHub_raw.txt renamed to data/GC_data_sheets/output/D4D - AI-READI FAIRHub.yaml
@@ -47,7 +47,14 @@ help: status
 	@echo "make site -- makes site locally"
 	@echo "make install -- install dependencies"
 	@echo "make test -- runs tests"
+	@echo "make test-modules -- validate all D4D module schemas"
 	@echo "make lint -- perfom linting"
+	@echo "make lint-modules -- lint all D4D module schemas"
+	@echo "make concat-docs INPUT_DIR=dir OUTPUT_FILE=file -- concatenate documents from directory"
+	@echo "make concat-extracted -- concatenate extracted D4D documents by column"
+	@echo "make concat-downloads -- concatenate raw downloads by column"
+	@echo "make process-concat INPUT_FILE=file -- process concatenated doc with D4D agent"
+	@echo "make process-all-concat -- process all concatenated docs with D4D agent"
 	@echo "make testdoc -- builds docs and runs local test server"
 	@echo "make deploy -- deploys site"
 	@echo "make update -- updates linkml version"
@@ -124,12 +131,30 @@ test: test-schema test-python test-examples
 test-schema: $(SOURCE_SCHEMA_ALL)
 	$(RUN) gen-project ${GEN_PARGS} -d tmp $(SOURCE_SCHEMA_ALL)
 
+# Test individual D4D module schemas
+test-modules:
+	@echo "Validating all D4D module schemas..."
+	@for module in $(SOURCE_SCHEMA_DIR)D4D_*.yaml; do \
+		echo "Validating $$module..."; \
+		$(RUN) gen-project -d tmp $$module || exit 1; \
+	done
+	@echo "All D4D module schemas validated successfully!"
+
 test-python:
 	$(RUN) python -m unittest discover
 
 lint:
 	$(RUN) linkml-lint $(SOURCE_SCHEMA_PATH)
 
+# Lint all D4D module schemas
+lint-modules:
+	@echo "Linting all D4D module schemas..."
+	@for module in $(SOURCE_SCHEMA_DIR)D4D_*.yaml; do \
+		echo "Linting $$module..."; \
+		$(RUN) linkml-lint $$module || exit 1; \
+	done
+	@echo "All D4D module schemas linted successfully!"
+
 check-config:
 	@(grep my-datamodel about.yaml > /dev/null && printf "\n**Project not configured**:\n\n  - Remember to edit 'about.yaml'\n\n" || exit 0)
 
@@ -168,7 +193,11 @@ $(DOCDIR):
 
 gendoc: $(DOCDIR)
 	cp $(SRC)/docs/*md $(DOCDIR) ; \
-	$(RUN) gen-doc ${GEN_DARGS} -d $(DOCDIR) $(SOURCE_SCHEMA_PATH)
+	$(RUN) gen-doc ${GEN_DARGS} -d $(DOCDIR) $(SOURCE_SCHEMA_PATH) ; \
+	mkdir -p $(DOCDIR)/html_output/concatenated ; \
+	cp -r $(SRC)/html/output/*.html $(DOCDIR)/html_output/ 2>/dev/null || true ; \
+	cp -r $(SRC)/html/output/concatenated/*.html $(DOCDIR)/html_output/concatenated/ 2>/dev/null || true ; \
+	cp $(SRC)/html/output/*.css $(DOCDIR)/html_output/ 2>/dev/null || true
 
 testdoc: gendoc serve
 
 
@@ -23,6 +23,202 @@ We are also tracking related developments, such as augmented Datasheets for Data
       Python datamodel
 * [tests/](tests/) - Python tests
 
+## D4D Metadata Generation
+
+This repository supports two distinct approaches for generating D4D (Datasheets for Datasets) metadata from dataset documentation:
+
+### Approach 1: Automated LLM API Agents 🤖
+
+**Use when**: You need to batch-process many files automatically with minimal human intervention.
+
+Automated scripts that use LLM APIs (OpenAI/Anthropic) to extract D4D metadata from dataset documentation. These agents run autonomously and can process hundreds of files in batch mode.
+
+#### 1.1 Validated D4D Wrapper (Recommended)
+
+```bash
+python src/download/validated_d4d_wrapper.py -i downloads_by_column -o data/extracted_by_column
+```
+
+**Features**:
+- Validates downloads succeeded
+- Checks content relevance to projects
+- Generates D4D YAML metadata via GPT-5
+- Creates detailed validation reports
+- Processes HTML, JSON, PDF, and text files
+- Adds generation metadata to YAML headers
+
+**Generated Metadata Includes**:
+```yaml
+# D4D Metadata extracted from: dataset_page.html
+# Column: AI_READI
+# Validation: Download ✅ success
+# Relevance: ✅ relevant
+# Generated: 2025-10-31 14:23:15
+# Generator: validated_d4d_wrapper (GPT-5)
+# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...
+```
+
+#### 1.2 Aurelian D4D Agent (Library Usage)
+
+For integration into Python applications:
+
+```python
+from aurelian.agents.d4d.d4d_agent import d4d_agent
+from aurelian.agents.d4d.d4d_config import D4DConfig
+
+# Process multiple sources (URLs and local files)
+sources = [
+    "https://example.com/dataset",
+    "/path/to/metadata.json",
+    "/path/to/documentation.html"
+]
+
+config = D4DConfig()
+result = await d4d_agent.run(
+    f"Extract metadata from: {', '.join(sources)}",
+    deps=config
+)
+
+print(result.data)  # D4D YAML output
+```
+
+**Supported File Types**: PDF, HTML, JSON, text/markdown (URLs and local files)
+
+#### 1.3 Basic D4D Wrapper (Simpler Version)
+
+```bash
+python src/download/d4d_agent_wrapper.py -i downloads_by_column -o data/extracted_by_column
+```
+
+Simpler version without validation steps, suitable for clean input data.
+
+**Requirements for API Agents**:
+- Set `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` environment variable
+- Wrappers use GPT-5 by default (configurable)
+- Files organized in column directories
+
+---
+
+### Approach 2: Interactive Coding Agents 👨‍💻
+
+**Use when**: You need human oversight, domain expertise, or customized metadata extraction.
+
+Use coding assistants like **Claude Code**, **GitHub Copilot**, or **Cursor** to generate D4D metadata interactively. This approach provides human-in-the-loop quality control and domain-specific reasoning.
+
+#### 2.1 Using Claude Code (Recommended)
+
+**Step 1**: Provide the schema and dataset documentation to Claude Code
+
+```
+Please generate D4D (Datasheets for Datasets) metadata for the dataset at:
+https://example.com/dataset
+
+Use the D4D schema at:
+https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/src/ontogpt/templates/data_sheets_schema.yaml
+
+Generate a complete YAML file following the schema structure.
+```
+
+**Step 2**: Claude Code will:
+- Fetch the dataset documentation
+- Analyze the content
+- Generate structured D4D YAML
+- Include reasoning about field mappings
+- Iterate based on your feedback
+
+**Generated Metadata Includes**:
+```yaml
+# D4D Metadata for: Example Dataset
+# Generated: 2025-10-31
+# Generator: Claude Code (claude-sonnet-4-5)
+# Method: Interactive extraction with human oversight
+# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...
+# Reviewed by: [Your Name]
+```
+
+#### 2.2 Workflow Example
+
+```bash
+# 1. Start interactive session with Claude Code
+claude-code
+
+# 2. Provide instructions
+"Generate D4D metadata for datasets in downloads_by_column/AI_READI/
+following the schema at [schema URL]"
+
+# 3. Review and refine
+# Claude Code will generate metadata and you can provide feedback:
+# - "Add more detail to the preprocessing section"
+# - "Include information from the supplementary materials"
+# - "Ensure all required fields are populated"
+
+# 4. Save validated output
+# Output is saved with generation metadata in YAML header
+```
+
+**Benefits of Interactive Approach**:
+- ✅ Human oversight and quality control
+- ✅ Domain expertise applied to field mapping
+- ✅ Iterative refinement based on feedback
+- ✅ Reasoning captured in generation process
+- ✅ Can handle complex, ambiguous documentation
+- ✅ Better handling of edge cases
+
+---
+
+### Comparison: When to Use Each Approach
+
+| Aspect | API Agents 🤖 | Interactive Coding Agents 👨‍💻 |
+|--------|---------------|-------------------------------|
+| **Speed** | Fast (batch processing) | Slower (interactive) |
+| **Scale** | Hundreds of files | Few files at a time |
+| **Quality** | Consistent, good | Variable, can be excellent |
+| **Human oversight** | Minimal | Full |
+| **Cost** | API costs × files | Time + API costs |
+| **Best for** | Standardized docs | Complex/ambiguous docs |
+| **Customization** | Limited | High |
+| **Domain expertise** | Model knowledge only | Human + model knowledge |
+
+### Recommended Workflow
+
+**For large-scale extraction**:
+1. Use API agents for initial batch processing
+2. Use coding agents to review and refine difficult cases
+3. Document any manual corrections
+
+**For high-value datasets**:
+1. Use coding agents with human oversight
+2. Validate against domain expertise
+3. Iterate until metadata is complete
+
+---
+
+### Generation Metadata Standards
+
+Both approaches should include standardized generation metadata in YAML headers:
+
+```yaml
+# D4D Metadata for: [Dataset Name]
+# Source: [URL or file path]
+# Generated: [ISO 8601 timestamp]
+# Generator: [Tool name and version/model]
+# Method: [automated | interactive | hybrid]
+# Schema: [D4D schema URL]
+# Validator: [Name/email if human reviewed]
+# Notes: [Any relevant generation notes]
+```
+
+### Script Locations
+
+- **This repo**: https://github.com/bridge2ai/data-sheets-schema
+- **API Agent Scripts**: [src/download/](src/download/)
+  - Validated wrapper: `src/download/validated_d4d_wrapper.py`
+  - Basic wrapper: `src/download/d4d_agent_wrapper.py`
+- **Aurelian D4D Agent**: [aurelian/src/aurelian/agents/d4d/](aurelian/src/aurelian/agents/d4d/)
+  - Agent: `d4d_agent.py`
+  - Tools: `d4d_tools.py`
+  - Config: `d4d_config.py`
+
 ## Developer Documentation
 
 <details>
 
@@ -0,0 +1 @@
+Subproject commit f7fa963954cf5acba7ee51c46e4ea6f69bed5d2a
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+Subproject commit f7fa963954cf5acba7ee51c46e4ea6f69bed5d2a`