Skip to content

Conversation

@realmarcin
Copy link
Collaborator

Summary

Comprehensive optimization of the D4D schema's ontology grounding based on expert review feedback from Harry Caufield. Reduced from 40+ to 25+ standards by removing redundant, unused, and overly granular ontology mappings while maintaining essential semantic interoperability.

Changes Overview

Phase 1-3: Remove Unused Standards (Commit 4df82e6)

Removed 7 standards with minimal/unclear benefit:

  • VoID (Vocabulary of Interlinked Datasets)
  • W3C Formats Registry (redundant with IANA Media Types)
  • FOAF (Friend of a Friend) - replaced with schema.org
  • BIBO (Bibliographic Ontology)
  • OSLC (Open Services for Lifecycle Collaboration)
  • Frictionless Data Standards (initial removal)
  • XSD from VariableTypeEnum (kept in schema.org mappings)

Replaced PAV with Dublin Core:

  • 9 PAV usages → Dublin Core equivalents
  • version → dcterms:hasVersion
  • previousVersion → dcterms:isVersionOf
  • createdOn → dcterms:created
  • createdBy → dcterms:creator
  • lastUpdateOn → dcterms:modified

Removed CSVW from Variables module:

  • 6 CSVW usages removed (Column, name, datatype, primaryKey, null, dialect)

Phase 4: Reduce PROV-O to Essential Use (Commit 4df82e6)

  • Before: 21 PROV-O mappings across schema
  • After: 1 essential mapping (was_derived_from slot)
  • Removed: Agent roles (9 mappings in CreatorOrMaintainerEnum)
  • Rationale: Redundant with schema.org and Dublin Core

Phase 5-6: Remove Security Classification Mappings (Commit 4df82e6)

Removed from ConfidentialityLevelEnum:

  • ISO 27001 mappings (Public, Internal, Highly Confidential)
  • NIST SP 800-60 mappings (Low Impact, Moderate Impact, High Impact)
  • Traffic Light Protocol (TLP) mappings (TLP:CLEAR, TLP:GREEN, TLP:AMBER)

Rationale: Security classification is domain-specific; prescriptive mappings may not fit all use cases

Phase 9: Remove GDPR and EU AI Act - US-Centric Focus (Commit 4fc1f85)

Per Harry's feedback: "stay US-centric"

Removed from D4D_Data_Governance.yaml:

  • gdpr_compliant field from ExportControlRegulatoryRestrictions
  • eu_ai_act_risk_category field from ExportControlRegulatoryRestrictions
  • Entire AIActRiskEnum (42 lines):
    • minimal_risk (EUAIAct:Article50)
    • limited_risk (EUAIAct:Article50, EUAIAct:TitleIV)
    • high_risk (EUAIAct:Article6, EUAIAct:AnnexIII, EUAIAct:TitleIII)
    • unacceptable_risk (EUAIAct:Article5)

Updated descriptions:

  • ComplianceStatusEnum: "GDPR, HIPAA, EU AI Act" → "HIPAA, 45 CFR 46"
  • ExportControlRegulatoryRestrictions: Now references "HIPAA and other US regulations"
  • D4D_Base_import Composition: Removed GDPR reference
  • D4D_Human regulatory_compliance: "(45 CFR 46, GDPR, HIPAA)" → "(45 CFR 46, HIPAA)"

Phase 10: Complete Frictionless & CSVW Cleanup (Commit 4fc1f85)

Per Harry's feedback: "wouldn't worry about mapping to it" (Frictionless), "more granular than D4D needs" (CSVW)

Prefix removals:

  • frictionless: https://specs.frictionlessdata.io/ (from D4D_Base_import.yaml and data_sheets_schema.yaml)
  • csvw: http://www.w3.org/ns/csvw# (from D4D_Base_import.yaml and data_sheets_schema.yaml)

Mapping removals:

  • slot_uri: csvw:dialect from dialect slot in D4D_Base_import.yaml

Harry Caufield's Feedback Alignment

Recommendation Status Implementation
Remove FOAF ✅ Complete Phase 1: Replaced with schema.org
Remove BIBO ✅ Complete Phase 1: Removed (minimal usage)
Keep AIO ✅ Complete Retained: BiasTypeEnum (9 bias types)
Reduce PROV-O ✅ Complete Phase 4: 21→1 usages (kept derivation)
Keep DUO ✅ Complete Retained: DataUsePermissionEnum (22 types)
Keep CRediT ✅ Complete Retained: CreatorOrMaintainerEnum mappings
Remove GDPR ✅ Complete Phase 9: Removed gdpr_compliant field
Remove EU AI Act ✅ Complete Phase 9: Removed AIActRiskEnum entirely
Remove Frictionless ✅ Complete Phase 10: Removed prefixes
Remove CSVW ✅ Complete Phase 10: Removed prefixes + mappings

Result: 100% alignment with expert recommendations (10/10 items)

Impact Analysis

Standards Count

  • Before: 40+ standards, ontologies, and frameworks
  • After: 25+ standards (37.5% reduction)

Code Changes

Commit 4df82e6 (Phases 1-8):

  • 15 files changed
  • 1,104 insertions(+)
  • 1,338 deletions(-)

Commit 4fc1f85 (Phases 9-10):

  • 10 files changed
  • 750 insertions(+)
  • 995 deletions(-)

Total:

  • 25 files changed (with overlap)
  • 1,854 insertions(+)
  • 2,333 deletions(-)

Retained Essential Standards

Ontologies (5):

  1. AIO - Artificial Intelligence Ontology (bias taxonomy)
  2. DUO - Data Use Ontology (GA4GH permissions)
  3. PROV-O - Provenance (minimal use: derivation only)
  4. SKOS - Knowledge organization
  5. Biolink - Biomedical knowledge graphs

Metadata Standards (6):

  1. Dublin Core (dcterms) - Core metadata vocabulary
  2. DCAT - Data catalog vocabulary
  3. schema.org - Structured data markup
  4. DataCite - Dataset citation metadata
  5. QUDT - Units and quantities
  6. Bridge2AI Standards Registry

Regulatory Frameworks (US-Centric) (2):

  1. HIPAA - Health Insurance Portability and Accountability Act
  2. 45 CFR 46 - Common Rule for human subjects research

Attribution Standards (1):

  1. CRediT - Contributor Roles Taxonomy

Technical Standards (3):

  1. IANA Media Types
  2. SPDX Licenses
  3. W3C Recommendations (SHACL, basic)

Files Changed

Schema Files (7)

  • src/data_sheets_schema/schema/D4D_Data_Governance.yaml
  • src/data_sheets_schema/schema/D4D_Base_import.yaml
  • src/data_sheets_schema/schema/D4D_Human.yaml
  • src/data_sheets_schema/schema/data_sheets_schema.yaml
  • src/data_sheets_schema/schema/D4D_Maintenance.yaml
  • src/data_sheets_schema/schema/D4D_Variables.yaml
  • src/data_sheets_schema/schema/D4D_Composition.yaml

Generated Artifacts (4)

  • src/data_sheets_schema/datamodel/data_sheets_schema.py
  • project/jsonld/data_sheets_schema.jsonld
  • project/jsonschema/data_sheets_schema.schema.json
  • project/owl/data_sheets_schema.owl.ttl

Documentation (2)

  • docs/standards_alignment.md (comprehensive changelog added)
  • docs/enum_ontology_grounding_report.md (phase documentation)

Validation

All tests pass successfully:

✅ make test-modules    # All D4D modules validate
✅ make test-schema     # Full merged schema validates
✅ make gen-project     # All artifacts regenerated
✅ make test            # Complete test suite passes

No breaking changes - all removed fields were optional.

Benefits

  1. Reduced Complexity: Fewer external dependencies to track and maintain
  2. Improved Clarity: Removed redundant/overlapping mappings
  3. US-Centric Focus: Aligned with primary user base regulations
  4. Better Maintainability: Focused on high-impact, well-maintained standards
  5. Expert-Validated: 100% alignment with expert review feedback

Documentation

Comprehensive documentation added in commit 4df82e6:

  • docs/standards_alignment.md - Full standards inventory and rationale
  • docs/enum_ontology_grounding_report.md - Enumeration ontology analysis
  • Both documents include detailed Phase 1-11 changelogs

Related Issues

Implements recommendations from expert review discussion regarding ontology grounding optimization and standards alignment strategy.

🤖 Generated with Claude Code

realmarcin and others added 6 commits December 1, 2025 21:23
- Added XSD and schema.org mappings to VariableTypeEnum (13 data types):
  - Exact mappings to XSD types: integer, float, double, string, boolean, date, datetime
  - Broad mappings to schema.org: Integer, Float, Number, Text, Boolean, Date, DateTime
  - Categorical types mapped to schema.org: Text, ItemList, StructuredValue
  - Added xsd: prefix (http://www.w3.org/2001/XMLSchema#)

- Added DataCite and Dublin Core mappings to DatasetRelationshipTypeEnum (14 types):
  - Exact mappings to dcterms: isVersionOf, replaces, isReplacedBy, isRequiredBy,
    requires, isPartOf, hasPart, isReferencedBy, references
  - Broad mappings for: derives_from (dcterms:source), supplements, is_identical_to
  - Added prov:wasDerivedFrom mapping for derives_from
  - Aligned with DataCite Metadata Schema 4.6 RelationType controlled vocabulary

- Regenerated schema artifacts (merged YAML, Python model, JSON Schema, OWL, JSON-LD)

Completes HIGH and MEDIUM priority ontology grounding recommendations from
enumeration grounding report. All priority enumerations now mapped to standard
ontologies: AIO for bias, DUO for data use, ISO/NIST/TLP for confidentiality,
XSD/schema.org for variable types, DataCite/dcterms for dataset relationships.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Added PROV-O and schema.org mappings to CreatorOrMaintainerEnum (9 agent types):
  - Person/Agent types: researcher, industry → prov:Person, schema:Person
  - Organization types: academic_institution → schema:EducationalOrganization
  - Government agencies → schema:GovernmentOrganization
  - Commercial entities → schema:Corporation
  - Non-profits → schema:Organization
  - Generic agents: data_subject, third_party, crowdsourced → prov:Agent

- Added EU AI Act official references to AIActRiskEnum (4 risk categories):
  - minimal_risk → EUAIAct:Article50 (general transparency)
  - limited_risk → EUAIAct:Article50, TitleIV (transparency obligations)
  - high_risk → EUAIAct:Article6, AnnexIII, TitleIII (strict requirements)
  - unacceptable_risk → EUAIAct:Article5 (prohibited practices)
  - Updated regulation reference to Regulation (EU) 2024/1689
  - Added EUR-Lex official publication link

- Enhanced ComplianceStatusEnum with dcterms mappings:
  - compliant, partially_compliant → dcterms:conformsTo
  - Improved descriptions for all 5 status values
  - Clarified workflow nature and evolution over time

- Regenerated schema artifacts (merged YAML, Python model, JSON Schema, OWL, JSON-LD)

Completes ALL enumeration ontology grounding. All 14 enumerations now have
appropriate ontology mappings or justification for domain-specific values.
Schema fully grounded in established standards: AIO, DUO, ISO/NIST/TLP, XSD,
schema.org, DataCite, dcterms, PROV-O, EU AI Act.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Documents 40+ standards, ontologies, and frameworks
- Organized by 9 categories: ontologies, metadata, web standards,
  measurement, regulatory, security, research, versioning, linked data
- Includes detailed coverage of:
  - W3C standards (DCAT, PROV-O, SKOS, CSVW, XSD, SHACL)
  - Dublin Core and DataCite
  - Ontologies (AIO, DUO, PROV-O, SKOS, FOAF)
  - Regulatory frameworks (GDPR, HIPAA, EU AI Act, 45 CFR 46)
  - Security standards (ISO 27001, NIST SP 800-60, TLP)
  - Research infrastructure (CRediT, Bridge2AI)
- Standards coverage matrix by module
- Mapping statistics and interoperability benefits
- Compliance and regulatory coverage analysis
This commit implements comprehensive ontology grounding optimization based on
expert review feedback (Harry Caufield). Reduces schema dependencies from 40+
to 25+ standards while maintaining semantic clarity and improving
maintainability.

## Changes by Phase

### Phase 1: Remove unused/minimal standards
- Removed VoID (Vocabulary of Interlinked Datasets) - minimal usage
- Removed W3C Formats Registry - redundant with IANA Media Types
- Removed FOAF (Friend of a Friend) - replaced with schema.org
- Removed BIBO (Bibliographic Ontology) - minimal usage
- Removed OSLC (Open Services for Lifecycle Collaboration) - single usage
- Removed Frictionless Data Standards - unclear benefit
- Removed XSD from VariableTypeEnum (kept schema.org mappings)

### Phase 2: Replace PAV with Dublin Core (9 usages)
- version → dcterms:hasVersion
- previousVersion → dcterms:isVersionOf
- createdOn → dcterms:created
- createdBy → dcterms:creator
- lastUpdateOn → dcterms:modified
Files: D4D_Maintenance.yaml

### Phase 3: Remove CSVW from Variables module (6 usages)
- Removed: Column, name, datatype, primaryKey, null, dialect mappings
- Replaced with schema.org equivalents where needed
Files: D4D_Variables.yaml

### Phase 4: Reduce PROV-O to essential use (21→1 usages)
- Kept: was_derived_from slot (prov:wasDerivedFrom)
- Removed: 9 Agent/Person/Organization mappings from CreatorOrMaintainerEnum
- Removed: Redundant derivation mappings from modules
Files: D4D_Base_import.yaml, D4D_Collection.yaml, D4D_Composition.yaml,
       D4D_Preprocessing.yaml, D4D_Variables.yaml

### Phase 5-6: Remove security classification mappings
- Removed ISO 27001 mappings from ConfidentialityLevelEnum
- Removed NIST SP 800-60 mappings from ConfidentialityLevelEnum
- Removed Traffic Light Protocol (TLP) mappings from ConfidentialityLevelEnum
- Rationale: Security classification is domain-specific
Files: D4D_Data_Governance.yaml

### Phase 7: Update standards documentation
- Updated docs/standards_alignment.md with comprehensive changelog
- Updated docs/enum_ontology_grounding_report.md with optimization details
- Documented 40+→25+ standard reduction
- Added rationale for all removals

### Phase 8: Validation and testing
- Fixed missing AIO prefix in main schema (data_sheets_schema.yaml)
- All module tests pass (make test-modules)
- All schema tests pass (make test-schema)
- All Python tests pass (make test-python)
- Project artifacts regenerated (make gen-project)

## Files Modified
Schema modules:
- src/data_sheets_schema/schema/data_sheets_schema.yaml (added AIO prefix)
- src/data_sheets_schema/schema/D4D_Base_import.yaml (PROV-O reduction)
- src/data_sheets_schema/schema/D4D_Collection.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Composition.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Data_Governance.yaml (removed security mappings)
- src/data_sheets_schema/schema/D4D_Maintenance.yaml (PAV→Dublin Core)
- src/data_sheets_schema/schema/D4D_Preprocessing.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Variables.yaml (removed CSVW, PROV-O)

Documentation:
- docs/standards_alignment.md (updated with optimization changelog)
- docs/enum_ontology_grounding_report.md (updated with changes)

Generated artifacts:
- src/data_sheets_schema/datamodel/data_sheets_schema.py
- project/jsonld/data_sheets_schema.jsonld
- project/jsonschema/data_sheets_schema.schema.json
- project/owl/data_sheets_schema.owl.ttl

## Rationale
- Eliminate redundancy (PAV/Dublin Core, FOAF/schema.org overlap)
- Remove unclear/unused ontologies (VoID, BIBO, OSLC)
- Simplify maintenance (fewer external dependencies)
- Focus on high-impact standards (DUO, schema.org, Dublin Core, DCAT)
- Improve clarity (avoid prescriptive security mappings)

## Testing
✓ All module schemas validate
✓ Full merged schema validates
✓ Python unit tests pass
✓ Example data validation passes
✓ Generated artifacts complete

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…/CSVW cleanup

## Phase 9: Remove GDPR and EU AI Act (US-Centric Focus)

Based on Harry Caufield's recommendation to "stay US-centric", removed all
EU regulatory framework references:

**D4D_Data_Governance.yaml:**
- Removed `gdpr_compliant` field from ExportControlRegulatoryRestrictions
- Removed `eu_ai_act_risk_category` field from ExportControlRegulatoryRestrictions
- Removed entire `AIActRiskEnum` enum (42 lines):
  - minimal_risk (EUAIAct:Article50)
  - limited_risk (EUAIAct:Article50, EUAIAct:TitleIV)
  - high_risk (EUAIAct:Article6, EUAIAct:AnnexIII, EUAIAct:TitleIII)
  - unacceptable_risk (EUAIAct:Article5)
- Updated ComplianceStatusEnum description: "GDPR, HIPAA, EU AI Act" → "HIPAA, 45 CFR 46"
- Updated ExportControlRegulatoryRestrictions description to reference "HIPAA and other US regulations"

**D4D_Base_import.yaml:**
- Updated Composition subset description to remove GDPR reference
- Changed from "EU's General Data Protection Regulation (GDPR)" to "applicable data protection regulations"

**D4D_Human.yaml:**
- Updated regulatory_compliance examples from "(e.g., 45 CFR 46, GDPR, HIPAA)" → "(e.g., 45 CFR 46, HIPAA)"

**Impact:** Schema now focuses exclusively on US regulations (HIPAA, 45 CFR 46) as primary compliance frameworks.

## Phase 10: Complete Frictionless & CSVW Cleanup

Per Harry's feedback ("wouldn't worry about mapping to it" for Frictionless,
"more granular than D4D needs" for CSVW):

**Prefix removals:**
- Removed `frictionless: https://specs.frictionlessdata.io/` from both:
  - D4D_Base_import.yaml (line 21)
  - data_sheets_schema.yaml (line 20)
- Removed `csvw: http://www.w3.org/ns/csvw#` from both:
  - D4D_Base_import.yaml (line 15)
  - data_sheets_schema.yaml (line 14)

**Mapping removals:**
- Removed `slot_uri: csvw:dialect` from dialect slot in D4D_Base_import.yaml

**Impact:** Fully removed overly granular CSVW and uncertain Frictionless mappings as recommended.

## Documentation Updates

**docs/standards_alignment.md:**
- Added Phase 9-11 to changelog (lines 488-499)
- Documents GDPR/EU AI Act removal rationale
- Documents Frictionless/CSVW cleanup completion

**docs/enum_ontology_grounding_report.md:**
- Added Phase 9-10 notes (lines 412-419)
- Documents AIActRiskEnum removal
- Notes prefix cleanup impact

## Validation

- ✅ make test-modules: All D4D modules validate successfully
- ✅ make test-schema: Full merged schema validates
- ✅ make gen-project: All artifacts regenerated (Python, JSON Schema, OWL, JSON-LD)
- ✅ No breaking changes (all removed fields were optional)

## Files Changed (10 schema/doc files, 3 generated artifacts)

Schema files:
- src/data_sheets_schema/schema/D4D_Data_Governance.yaml (removed 3 fields, 1 enum, updated 2 descriptions)
- src/data_sheets_schema/schema/D4D_Base_import.yaml (removed 2 prefixes, 1 mapping, updated 1 description)
- src/data_sheets_schema/schema/D4D_Human.yaml (updated 1 description)
- src/data_sheets_schema/schema/data_sheets_schema.yaml (removed 2 prefixes)

Generated artifacts:
- src/data_sheets_schema/datamodel/data_sheets_schema.py (regenerated)
- project/jsonld/data_sheets_schema.jsonld (regenerated)
- project/jsonschema/data_sheets_schema.schema.json (regenerated)
- project/owl/data_sheets_schema.owl.ttl (regenerated)

Documentation:
- docs/standards_alignment.md (added Phase 9-11 changelog)
- docs/enum_ontology_grounding_report.md (added Phase 9-10 notes)

## Summary

This completes alignment with Harry Caufield's expert recommendations:
- ✅ GDPR removed ("stay US-centric")
- ✅ EU AI Act removed ("same as for GDPR")
- ✅ Frictionless prefixes removed ("wouldn't worry about mapping to it")
- ✅ CSVW fully removed ("more granular than D4D needs")

Result: Schema reduced from 40+ to 25+ standards with US-centric regulatory focus.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Resolved conflicts by regenerating artifacts from source schemas.

Changes merged from main:
- Updated GitHub Actions workflows to use Python 3.10, 3.11, 3.12
- Fixed Makefile test-examples target to use merged schema
- Removed obsolete example output files

Conflict resolution:
- Regenerated src/data_sheets_schema/datamodel/data_sheets_schema.py
- Regenerated project/jsonld/data_sheets_schema.jsonld
- Regenerated project/owl/data_sheets_schema.owl.ttl

All artifacts now consistent with reduced ontology mappings from
more-mappings branch plus Python 3.10+ support from main.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@realmarcin realmarcin merged commit ace2df5 into main Dec 3, 2025
3 checks passed
@realmarcin realmarcin deleted the more-mappings branch December 3, 2025 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant