-
Notifications
You must be signed in to change notification settings - Fork 3
Optimize ontology grounding: reduce from 40+ to 25+ standards #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Added XSD and schema.org mappings to VariableTypeEnum (13 data types): - Exact mappings to XSD types: integer, float, double, string, boolean, date, datetime - Broad mappings to schema.org: Integer, Float, Number, Text, Boolean, Date, DateTime - Categorical types mapped to schema.org: Text, ItemList, StructuredValue - Added xsd: prefix (http://www.w3.org/2001/XMLSchema#) - Added DataCite and Dublin Core mappings to DatasetRelationshipTypeEnum (14 types): - Exact mappings to dcterms: isVersionOf, replaces, isReplacedBy, isRequiredBy, requires, isPartOf, hasPart, isReferencedBy, references - Broad mappings for: derives_from (dcterms:source), supplements, is_identical_to - Added prov:wasDerivedFrom mapping for derives_from - Aligned with DataCite Metadata Schema 4.6 RelationType controlled vocabulary - Regenerated schema artifacts (merged YAML, Python model, JSON Schema, OWL, JSON-LD) Completes HIGH and MEDIUM priority ontology grounding recommendations from enumeration grounding report. All priority enumerations now mapped to standard ontologies: AIO for bias, DUO for data use, ISO/NIST/TLP for confidentiality, XSD/schema.org for variable types, DataCite/dcterms for dataset relationships. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Added PROV-O and schema.org mappings to CreatorOrMaintainerEnum (9 agent types): - Person/Agent types: researcher, industry → prov:Person, schema:Person - Organization types: academic_institution → schema:EducationalOrganization - Government agencies → schema:GovernmentOrganization - Commercial entities → schema:Corporation - Non-profits → schema:Organization - Generic agents: data_subject, third_party, crowdsourced → prov:Agent - Added EU AI Act official references to AIActRiskEnum (4 risk categories): - minimal_risk → EUAIAct:Article50 (general transparency) - limited_risk → EUAIAct:Article50, TitleIV (transparency obligations) - high_risk → EUAIAct:Article6, AnnexIII, TitleIII (strict requirements) - unacceptable_risk → EUAIAct:Article5 (prohibited practices) - Updated regulation reference to Regulation (EU) 2024/1689 - Added EUR-Lex official publication link - Enhanced ComplianceStatusEnum with dcterms mappings: - compliant, partially_compliant → dcterms:conformsTo - Improved descriptions for all 5 status values - Clarified workflow nature and evolution over time - Regenerated schema artifacts (merged YAML, Python model, JSON Schema, OWL, JSON-LD) Completes ALL enumeration ontology grounding. All 14 enumerations now have appropriate ontology mappings or justification for domain-specific values. Schema fully grounded in established standards: AIO, DUO, ISO/NIST/TLP, XSD, schema.org, DataCite, dcterms, PROV-O, EU AI Act. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Documents 40+ standards, ontologies, and frameworks - Organized by 9 categories: ontologies, metadata, web standards, measurement, regulatory, security, research, versioning, linked data - Includes detailed coverage of: - W3C standards (DCAT, PROV-O, SKOS, CSVW, XSD, SHACL) - Dublin Core and DataCite - Ontologies (AIO, DUO, PROV-O, SKOS, FOAF) - Regulatory frameworks (GDPR, HIPAA, EU AI Act, 45 CFR 46) - Security standards (ISO 27001, NIST SP 800-60, TLP) - Research infrastructure (CRediT, Bridge2AI) - Standards coverage matrix by module - Mapping statistics and interoperability benefits - Compliance and regulatory coverage analysis
This commit implements comprehensive ontology grounding optimization based on
expert review feedback (Harry Caufield). Reduces schema dependencies from 40+
to 25+ standards while maintaining semantic clarity and improving
maintainability.
## Changes by Phase
### Phase 1: Remove unused/minimal standards
- Removed VoID (Vocabulary of Interlinked Datasets) - minimal usage
- Removed W3C Formats Registry - redundant with IANA Media Types
- Removed FOAF (Friend of a Friend) - replaced with schema.org
- Removed BIBO (Bibliographic Ontology) - minimal usage
- Removed OSLC (Open Services for Lifecycle Collaboration) - single usage
- Removed Frictionless Data Standards - unclear benefit
- Removed XSD from VariableTypeEnum (kept schema.org mappings)
### Phase 2: Replace PAV with Dublin Core (9 usages)
- version → dcterms:hasVersion
- previousVersion → dcterms:isVersionOf
- createdOn → dcterms:created
- createdBy → dcterms:creator
- lastUpdateOn → dcterms:modified
Files: D4D_Maintenance.yaml
### Phase 3: Remove CSVW from Variables module (6 usages)
- Removed: Column, name, datatype, primaryKey, null, dialect mappings
- Replaced with schema.org equivalents where needed
Files: D4D_Variables.yaml
### Phase 4: Reduce PROV-O to essential use (21→1 usages)
- Kept: was_derived_from slot (prov:wasDerivedFrom)
- Removed: 9 Agent/Person/Organization mappings from CreatorOrMaintainerEnum
- Removed: Redundant derivation mappings from modules
Files: D4D_Base_import.yaml, D4D_Collection.yaml, D4D_Composition.yaml,
D4D_Preprocessing.yaml, D4D_Variables.yaml
### Phase 5-6: Remove security classification mappings
- Removed ISO 27001 mappings from ConfidentialityLevelEnum
- Removed NIST SP 800-60 mappings from ConfidentialityLevelEnum
- Removed Traffic Light Protocol (TLP) mappings from ConfidentialityLevelEnum
- Rationale: Security classification is domain-specific
Files: D4D_Data_Governance.yaml
### Phase 7: Update standards documentation
- Updated docs/standards_alignment.md with comprehensive changelog
- Updated docs/enum_ontology_grounding_report.md with optimization details
- Documented 40+→25+ standard reduction
- Added rationale for all removals
### Phase 8: Validation and testing
- Fixed missing AIO prefix in main schema (data_sheets_schema.yaml)
- All module tests pass (make test-modules)
- All schema tests pass (make test-schema)
- All Python tests pass (make test-python)
- Project artifacts regenerated (make gen-project)
## Files Modified
Schema modules:
- src/data_sheets_schema/schema/data_sheets_schema.yaml (added AIO prefix)
- src/data_sheets_schema/schema/D4D_Base_import.yaml (PROV-O reduction)
- src/data_sheets_schema/schema/D4D_Collection.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Composition.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Data_Governance.yaml (removed security mappings)
- src/data_sheets_schema/schema/D4D_Maintenance.yaml (PAV→Dublin Core)
- src/data_sheets_schema/schema/D4D_Preprocessing.yaml (removed PROV-O)
- src/data_sheets_schema/schema/D4D_Variables.yaml (removed CSVW, PROV-O)
Documentation:
- docs/standards_alignment.md (updated with optimization changelog)
- docs/enum_ontology_grounding_report.md (updated with changes)
Generated artifacts:
- src/data_sheets_schema/datamodel/data_sheets_schema.py
- project/jsonld/data_sheets_schema.jsonld
- project/jsonschema/data_sheets_schema.schema.json
- project/owl/data_sheets_schema.owl.ttl
## Rationale
- Eliminate redundancy (PAV/Dublin Core, FOAF/schema.org overlap)
- Remove unclear/unused ontologies (VoID, BIBO, OSLC)
- Simplify maintenance (fewer external dependencies)
- Focus on high-impact standards (DUO, schema.org, Dublin Core, DCAT)
- Improve clarity (avoid prescriptive security mappings)
## Testing
✓ All module schemas validate
✓ Full merged schema validates
✓ Python unit tests pass
✓ Example data validation passes
✓ Generated artifacts complete
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
…/CSVW cleanup
## Phase 9: Remove GDPR and EU AI Act (US-Centric Focus)
Based on Harry Caufield's recommendation to "stay US-centric", removed all
EU regulatory framework references:
**D4D_Data_Governance.yaml:**
- Removed `gdpr_compliant` field from ExportControlRegulatoryRestrictions
- Removed `eu_ai_act_risk_category` field from ExportControlRegulatoryRestrictions
- Removed entire `AIActRiskEnum` enum (42 lines):
- minimal_risk (EUAIAct:Article50)
- limited_risk (EUAIAct:Article50, EUAIAct:TitleIV)
- high_risk (EUAIAct:Article6, EUAIAct:AnnexIII, EUAIAct:TitleIII)
- unacceptable_risk (EUAIAct:Article5)
- Updated ComplianceStatusEnum description: "GDPR, HIPAA, EU AI Act" → "HIPAA, 45 CFR 46"
- Updated ExportControlRegulatoryRestrictions description to reference "HIPAA and other US regulations"
**D4D_Base_import.yaml:**
- Updated Composition subset description to remove GDPR reference
- Changed from "EU's General Data Protection Regulation (GDPR)" to "applicable data protection regulations"
**D4D_Human.yaml:**
- Updated regulatory_compliance examples from "(e.g., 45 CFR 46, GDPR, HIPAA)" → "(e.g., 45 CFR 46, HIPAA)"
**Impact:** Schema now focuses exclusively on US regulations (HIPAA, 45 CFR 46) as primary compliance frameworks.
## Phase 10: Complete Frictionless & CSVW Cleanup
Per Harry's feedback ("wouldn't worry about mapping to it" for Frictionless,
"more granular than D4D needs" for CSVW):
**Prefix removals:**
- Removed `frictionless: https://specs.frictionlessdata.io/` from both:
- D4D_Base_import.yaml (line 21)
- data_sheets_schema.yaml (line 20)
- Removed `csvw: http://www.w3.org/ns/csvw#` from both:
- D4D_Base_import.yaml (line 15)
- data_sheets_schema.yaml (line 14)
**Mapping removals:**
- Removed `slot_uri: csvw:dialect` from dialect slot in D4D_Base_import.yaml
**Impact:** Fully removed overly granular CSVW and uncertain Frictionless mappings as recommended.
## Documentation Updates
**docs/standards_alignment.md:**
- Added Phase 9-11 to changelog (lines 488-499)
- Documents GDPR/EU AI Act removal rationale
- Documents Frictionless/CSVW cleanup completion
**docs/enum_ontology_grounding_report.md:**
- Added Phase 9-10 notes (lines 412-419)
- Documents AIActRiskEnum removal
- Notes prefix cleanup impact
## Validation
- ✅ make test-modules: All D4D modules validate successfully
- ✅ make test-schema: Full merged schema validates
- ✅ make gen-project: All artifacts regenerated (Python, JSON Schema, OWL, JSON-LD)
- ✅ No breaking changes (all removed fields were optional)
## Files Changed (10 schema/doc files, 3 generated artifacts)
Schema files:
- src/data_sheets_schema/schema/D4D_Data_Governance.yaml (removed 3 fields, 1 enum, updated 2 descriptions)
- src/data_sheets_schema/schema/D4D_Base_import.yaml (removed 2 prefixes, 1 mapping, updated 1 description)
- src/data_sheets_schema/schema/D4D_Human.yaml (updated 1 description)
- src/data_sheets_schema/schema/data_sheets_schema.yaml (removed 2 prefixes)
Generated artifacts:
- src/data_sheets_schema/datamodel/data_sheets_schema.py (regenerated)
- project/jsonld/data_sheets_schema.jsonld (regenerated)
- project/jsonschema/data_sheets_schema.schema.json (regenerated)
- project/owl/data_sheets_schema.owl.ttl (regenerated)
Documentation:
- docs/standards_alignment.md (added Phase 9-11 changelog)
- docs/enum_ontology_grounding_report.md (added Phase 9-10 notes)
## Summary
This completes alignment with Harry Caufield's expert recommendations:
- ✅ GDPR removed ("stay US-centric")
- ✅ EU AI Act removed ("same as for GDPR")
- ✅ Frictionless prefixes removed ("wouldn't worry about mapping to it")
- ✅ CSVW fully removed ("more granular than D4D needs")
Result: Schema reduced from 40+ to 25+ standards with US-centric regulatory focus.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Resolved conflicts by regenerating artifacts from source schemas. Changes merged from main: - Updated GitHub Actions workflows to use Python 3.10, 3.11, 3.12 - Fixed Makefile test-examples target to use merged schema - Removed obsolete example output files Conflict resolution: - Regenerated src/data_sheets_schema/datamodel/data_sheets_schema.py - Regenerated project/jsonld/data_sheets_schema.jsonld - Regenerated project/owl/data_sheets_schema.owl.ttl All artifacts now consistent with reduced ontology mappings from more-mappings branch plus Python 3.10+ support from main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Comprehensive optimization of the D4D schema's ontology grounding based on expert review feedback from Harry Caufield. Reduced from 40+ to 25+ standards by removing redundant, unused, and overly granular ontology mappings while maintaining essential semantic interoperability.
Changes Overview
Phase 1-3: Remove Unused Standards (Commit 4df82e6)
Removed 7 standards with minimal/unclear benefit:
Replaced PAV with Dublin Core:
Removed CSVW from Variables module:
Phase 4: Reduce PROV-O to Essential Use (Commit 4df82e6)
Phase 5-6: Remove Security Classification Mappings (Commit 4df82e6)
Removed from ConfidentialityLevelEnum:
Rationale: Security classification is domain-specific; prescriptive mappings may not fit all use cases
Phase 9: Remove GDPR and EU AI Act - US-Centric Focus (Commit 4fc1f85)
Per Harry's feedback: "stay US-centric"
Removed from D4D_Data_Governance.yaml:
gdpr_compliantfield from ExportControlRegulatoryRestrictionseu_ai_act_risk_categoryfield from ExportControlRegulatoryRestrictionsAIActRiskEnum(42 lines):Updated descriptions:
Phase 10: Complete Frictionless & CSVW Cleanup (Commit 4fc1f85)
Per Harry's feedback: "wouldn't worry about mapping to it" (Frictionless), "more granular than D4D needs" (CSVW)
Prefix removals:
frictionless: https://specs.frictionlessdata.io/(from D4D_Base_import.yaml and data_sheets_schema.yaml)csvw: http://www.w3.org/ns/csvw#(from D4D_Base_import.yaml and data_sheets_schema.yaml)Mapping removals:
slot_uri: csvw:dialectfrom dialect slot in D4D_Base_import.yamlHarry Caufield's Feedback Alignment
Result: 100% alignment with expert recommendations (10/10 items)
Impact Analysis
Standards Count
Code Changes
Commit 4df82e6 (Phases 1-8):
Commit 4fc1f85 (Phases 9-10):
Total:
Retained Essential Standards
Ontologies (5):
Metadata Standards (6):
Regulatory Frameworks (US-Centric) (2):
Attribution Standards (1):
Technical Standards (3):
Files Changed
Schema Files (7)
src/data_sheets_schema/schema/D4D_Data_Governance.yamlsrc/data_sheets_schema/schema/D4D_Base_import.yamlsrc/data_sheets_schema/schema/D4D_Human.yamlsrc/data_sheets_schema/schema/data_sheets_schema.yamlsrc/data_sheets_schema/schema/D4D_Maintenance.yamlsrc/data_sheets_schema/schema/D4D_Variables.yamlsrc/data_sheets_schema/schema/D4D_Composition.yamlGenerated Artifacts (4)
src/data_sheets_schema/datamodel/data_sheets_schema.pyproject/jsonld/data_sheets_schema.jsonldproject/jsonschema/data_sheets_schema.schema.jsonproject/owl/data_sheets_schema.owl.ttlDocumentation (2)
docs/standards_alignment.md(comprehensive changelog added)docs/enum_ontology_grounding_report.md(phase documentation)Validation
All tests pass successfully:
No breaking changes - all removed fields were optional.
Benefits
Documentation
Comprehensive documentation added in commit 4df82e6:
docs/standards_alignment.md- Full standards inventory and rationaledocs/enum_ontology_grounding_report.md- Enumeration ontology analysisRelated Issues
Implements recommendations from expert review discussion regarding ontology grounding optimization and standards alignment strategy.
🤖 Generated with Claude Code