Skip to content

Commit 0982496

Browse files
authored
Merge pull request #46 from bridge2ai/extract
Extract
2 parents 33c1791 + e1b19f5 commit 0982496

File tree

203 files changed

+40681
-5953
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

203 files changed

+40681
-5953
lines changed

.DS_Store

6 KB
Binary file not shown.

CLAUDE.md

Lines changed: 302 additions & 25 deletions
Large diffs are not rendered by default.

Makefile

Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,14 @@ help: status
4747
@echo "make site -- makes site locally"
4848
@echo "make install -- install dependencies"
4949
@echo "make test -- runs tests"
50+
@echo "make test-modules -- validate all D4D module schemas"
5051
@echo "make lint -- perfom linting"
52+
@echo "make lint-modules -- lint all D4D module schemas"
53+
@echo "make concat-docs INPUT_DIR=dir OUTPUT_FILE=file -- concatenate documents from directory"
54+
@echo "make concat-extracted -- concatenate extracted D4D documents by column"
55+
@echo "make concat-downloads -- concatenate raw downloads by column"
56+
@echo "make process-concat INPUT_FILE=file -- process concatenated doc with D4D agent"
57+
@echo "make process-all-concat -- process all concatenated docs with D4D agent"
5158
@echo "make testdoc -- builds docs and runs local test server"
5259
@echo "make deploy -- deploys site"
5360
@echo "make update -- updates linkml version"
@@ -124,12 +131,30 @@ test: test-schema test-python test-examples
124131
test-schema: $(SOURCE_SCHEMA_ALL)
125132
$(RUN) gen-project ${GEN_PARGS} -d tmp $(SOURCE_SCHEMA_ALL)
126133

134+
# Test individual D4D module schemas
135+
test-modules:
136+
@echo "Validating all D4D module schemas..."
137+
@for module in $(SOURCE_SCHEMA_DIR)D4D_*.yaml; do \
138+
echo "Validating $$module..."; \
139+
$(RUN) gen-project -d tmp $$module || exit 1; \
140+
done
141+
@echo "All D4D module schemas validated successfully!"
142+
127143
test-python:
128144
$(RUN) python -m unittest discover
129145

130146
lint:
131147
$(RUN) linkml-lint $(SOURCE_SCHEMA_PATH)
132148

149+
# Lint all D4D module schemas
150+
lint-modules:
151+
@echo "Linting all D4D module schemas..."
152+
@for module in $(SOURCE_SCHEMA_DIR)D4D_*.yaml; do \
153+
echo "Linting $$module..."; \
154+
$(RUN) linkml-lint $$module || exit 1; \
155+
done
156+
@echo "All D4D module schemas linted successfully!"
157+
133158
check-config:
134159
@(grep my-datamodel about.yaml > /dev/null && printf "\n**Project not configured**:\n\n - Remember to edit 'about.yaml'\n\n" || exit 0)
135160

@@ -168,7 +193,11 @@ $(DOCDIR):
168193

169194
gendoc: $(DOCDIR)
170195
cp $(SRC)/docs/*md $(DOCDIR) ; \
171-
$(RUN) gen-doc ${GEN_DARGS} -d $(DOCDIR) $(SOURCE_SCHEMA_PATH)
196+
$(RUN) gen-doc ${GEN_DARGS} -d $(DOCDIR) $(SOURCE_SCHEMA_PATH) ; \
197+
mkdir -p $(DOCDIR)/html_output/concatenated ; \
198+
cp -r $(SRC)/html/output/*.html $(DOCDIR)/html_output/ 2>/dev/null || true ; \
199+
cp -r $(SRC)/html/output/concatenated/*.html $(DOCDIR)/html_output/concatenated/ 2>/dev/null || true ; \
200+
cp $(SRC)/html/output/*.css $(DOCDIR)/html_output/ 2>/dev/null || true
172201

173202
testdoc: gendoc serve
174203

README.md

Lines changed: 196 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,202 @@ We are also tracking related developments, such as augmented Datasheets for Data
2323
Python datamodel
2424
* [tests/](tests/) - Python tests
2525

26+
## D4D Metadata Generation
27+
28+
This repository supports two distinct approaches for generating D4D (Datasheets for Datasets) metadata from dataset documentation:
29+
30+
### Approach 1: Automated LLM API Agents 🤖
31+
32+
**Use when**: You need to batch-process many files automatically with minimal human intervention.
33+
34+
Automated scripts that use LLM APIs (OpenAI/Anthropic) to extract D4D metadata from dataset documentation. These agents run autonomously and can process hundreds of files in batch mode.
35+
36+
#### 1.1 Validated D4D Wrapper (Recommended)
37+
38+
```bash
39+
python src/download/validated_d4d_wrapper.py -i downloads_by_column -o data/extracted_by_column
40+
```
41+
42+
**Features**:
43+
- Validates downloads succeeded
44+
- Checks content relevance to projects
45+
- Generates D4D YAML metadata via GPT-5
46+
- Creates detailed validation reports
47+
- Processes HTML, JSON, PDF, and text files
48+
- Adds generation metadata to YAML headers
49+
50+
**Generated Metadata Includes**:
51+
```yaml
52+
# D4D Metadata extracted from: dataset_page.html
53+
# Column: AI_READI
54+
# Validation: Download ✅ success
55+
# Relevance: ✅ relevant
56+
# Generated: 2025-10-31 14:23:15
57+
# Generator: validated_d4d_wrapper (GPT-5)
58+
# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...
59+
```
60+
61+
#### 1.2 Aurelian D4D Agent (Library Usage)
62+
63+
For integration into Python applications:
64+
65+
```python
66+
from aurelian.agents.d4d.d4d_agent import d4d_agent
67+
from aurelian.agents.d4d.d4d_config import D4DConfig
68+
69+
# Process multiple sources (URLs and local files)
70+
sources = [
71+
"https://example.com/dataset",
72+
"/path/to/metadata.json",
73+
"/path/to/documentation.html"
74+
]
75+
76+
config = D4DConfig()
77+
result = await d4d_agent.run(
78+
f"Extract metadata from: {', '.join(sources)}",
79+
deps=config
80+
)
81+
82+
print(result.data) # D4D YAML output
83+
```
84+
85+
**Supported File Types**: PDF, HTML, JSON, text/markdown (URLs and local files)
86+
87+
#### 1.3 Basic D4D Wrapper (Simpler Version)
88+
89+
```bash
90+
python src/download/d4d_agent_wrapper.py -i downloads_by_column -o data/extracted_by_column
91+
```
92+
93+
Simpler version without validation steps, suitable for clean input data.
94+
95+
**Requirements for API Agents**:
96+
- Set `ANTHROPIC_API_KEY` or `OPENAI_API_KEY` environment variable
97+
- Wrappers use GPT-5 by default (configurable)
98+
- Files organized in column directories
99+
100+
---
101+
102+
### Approach 2: Interactive Coding Agents 👨‍💻
103+
104+
**Use when**: You need human oversight, domain expertise, or customized metadata extraction.
105+
106+
Use coding assistants like **Claude Code**, **GitHub Copilot**, or **Cursor** to generate D4D metadata interactively. This approach provides human-in-the-loop quality control and domain-specific reasoning.
107+
108+
#### 2.1 Using Claude Code (Recommended)
109+
110+
**Step 1**: Provide the schema and dataset documentation to Claude Code
111+
112+
```
113+
Please generate D4D (Datasheets for Datasets) metadata for the dataset at:
114+
https://example.com/dataset
115+
116+
Use the D4D schema at:
117+
https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/src/ontogpt/templates/data_sheets_schema.yaml
118+
119+
Generate a complete YAML file following the schema structure.
120+
```
121+
122+
**Step 2**: Claude Code will:
123+
- Fetch the dataset documentation
124+
- Analyze the content
125+
- Generate structured D4D YAML
126+
- Include reasoning about field mappings
127+
- Iterate based on your feedback
128+
129+
**Generated Metadata Includes**:
130+
```yaml
131+
# D4D Metadata for: Example Dataset
132+
# Generated: 2025-10-31
133+
# Generator: Claude Code (claude-sonnet-4-5)
134+
# Method: Interactive extraction with human oversight
135+
# Schema: https://raw.githubusercontent.com/monarch-initiative/ontogpt/main/...
136+
# Reviewed by: [Your Name]
137+
```
138+
139+
#### 2.2 Workflow Example
140+
141+
```bash
142+
# 1. Start interactive session with Claude Code
143+
claude-code
144+
145+
# 2. Provide instructions
146+
"Generate D4D metadata for datasets in downloads_by_column/AI_READI/
147+
following the schema at [schema URL]"
148+
149+
# 3. Review and refine
150+
# Claude Code will generate metadata and you can provide feedback:
151+
# - "Add more detail to the preprocessing section"
152+
# - "Include information from the supplementary materials"
153+
# - "Ensure all required fields are populated"
154+
155+
# 4. Save validated output
156+
# Output is saved with generation metadata in YAML header
157+
```
158+
159+
**Benefits of Interactive Approach**:
160+
- ✅ Human oversight and quality control
161+
- ✅ Domain expertise applied to field mapping
162+
- ✅ Iterative refinement based on feedback
163+
- ✅ Reasoning captured in generation process
164+
- ✅ Can handle complex, ambiguous documentation
165+
- ✅ Better handling of edge cases
166+
167+
---
168+
169+
### Comparison: When to Use Each Approach
170+
171+
| Aspect | API Agents 🤖 | Interactive Coding Agents 👨‍💻 |
172+
|--------|---------------|-------------------------------|
173+
| **Speed** | Fast (batch processing) | Slower (interactive) |
174+
| **Scale** | Hundreds of files | Few files at a time |
175+
| **Quality** | Consistent, good | Variable, can be excellent |
176+
| **Human oversight** | Minimal | Full |
177+
| **Cost** | API costs × files | Time + API costs |
178+
| **Best for** | Standardized docs | Complex/ambiguous docs |
179+
| **Customization** | Limited | High |
180+
| **Domain expertise** | Model knowledge only | Human + model knowledge |
181+
182+
### Recommended Workflow
183+
184+
**For large-scale extraction**:
185+
1. Use API agents for initial batch processing
186+
2. Use coding agents to review and refine difficult cases
187+
3. Document any manual corrections
188+
189+
**For high-value datasets**:
190+
1. Use coding agents with human oversight
191+
2. Validate against domain expertise
192+
3. Iterate until metadata is complete
193+
194+
---
195+
196+
### Generation Metadata Standards
197+
198+
Both approaches should include standardized generation metadata in YAML headers:
199+
200+
```yaml
201+
# D4D Metadata for: [Dataset Name]
202+
# Source: [URL or file path]
203+
# Generated: [ISO 8601 timestamp]
204+
# Generator: [Tool name and version/model]
205+
# Method: [automated | interactive | hybrid]
206+
# Schema: [D4D schema URL]
207+
# Validator: [Name/email if human reviewed]
208+
# Notes: [Any relevant generation notes]
209+
```
210+
211+
### Script Locations
212+
213+
- **This repo**: https://github.com/bridge2ai/data-sheets-schema
214+
- **API Agent Scripts**: [src/download/](src/download/)
215+
- Validated wrapper: `src/download/validated_d4d_wrapper.py`
216+
- Basic wrapper: `src/download/d4d_agent_wrapper.py`
217+
- **Aurelian D4D Agent**: [aurelian/src/aurelian/agents/d4d/](aurelian/src/aurelian/agents/d4d/)
218+
- Agent: `d4d_agent.py`
219+
- Tools: `d4d_tools.py`
220+
- Config: `d4d_config.py`
221+
26222
## Developer Documentation
27223

28224
<details>

aurelian

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Subproject commit f7fa963954cf5acba7ee51c46e4ea6f69bed5d2a

data/.DS_Store

6 KB
Binary file not shown.

data/GC_data_sheets/output/D4D - AI-READI FAIRHub.txt

Lines changed: 0 additions & 16 deletions
This file was deleted.

data/GC_data_sheets/output/D4D - AI-READI FAIRHub_raw.txt renamed to data/GC_data_sheets/output/D4D - AI-READI FAIRHub.yaml

File renamed without changes.

0 commit comments

Comments
 (0)