-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Rdf ingestion mvp #15741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Rdf ingestion mvp #15741
Conversation
|
Linear: ING-1308 |
|
✅ Meticulous spotted 0 visual differences across 976 screens tested: view results. Meticulous evaluated ~8 hours of user flows against your PR. Expected differences? Click here. Last updated for commit bfdc088. This comment will update as new commits are pushed. |
Bundle ReportChanges will increase total bundle size by 3.34kB (0.01%) ⬆️. This is within the configured threshold ✅ Detailed changes
Affected Assets, Files, and Routes:view changes for bundle: datahub-react-web-esmAssets Changed:
Files in
|
83de506 to
4ba2ba2
Compare
- Add RDF ingestion source for glossary terms, domains, and relationships - Streamlined architecture: extractors return DataHub AST directly - Removed unnecessary abstraction layers (RDF AST, converters where not needed) - Support for SKOS, OWL, and other RDF vocabularies - Comprehensive test coverage with 128 passing tests - UI integration for RDF source configuration
- Remove build_relationship_mcps() method from GlossaryTermMCPBuilder - Update tests to use RelationshipMCPBuilder directly - Clean separation: glossary_term handles terms, relationship handles relationships - Refactor _generate_workunits_from_ast to reduce complexity
…ance - Enhance error handling in RDFSource to provide actionable messages for missing files, malformed RDF, and invalid formats. - Implement unit tests to verify error handling behavior and ensure graceful degradation. - Update glossary term URN generation to use dot notation for hierarchical paths. - Improve logging for large file processing and ensure consistent URN formats across glossary nodes and terms. - Refactor methods to yield MCPs for memory efficiency during processing.
…uidance - Added helper text for various RDF source fields including source, format, extensions, recursive processing, environment, and dialect. - These enhancements aim to provide clearer instructions and examples for users configuring RDF ingestion settings.
- Introduced new RDF platform entry in capability_summary.json with detailed capabilities including deletion detection, tags, ownership, lineage, data profiling, domains, descriptions, and platform instance support. - Each capability includes a description and support status to enhance clarity for users configuring RDF ingestion.
- Changed the warning filter to ignore specific SQLAlchemy warnings. - Added a new dependency for RDF support in the setup configuration.
…ters - Introduced a new documentation file for the RDF ingestion source. - Updated type hints across various classes to use `Optional` for context parameters, enhancing code clarity and type safety. - Adjusted method signatures in `EntityExtractor`, `EntityConverter`, `EntityMCPBuilder`, and related classes to reflect these changes.
- Included "rdf" in both base development and full test development requirements in setup.py to ensure proper support for RDF ingestion.
This reverts commit 48a0118.
- Added unit tests for duplicate term definition handling, ensuring correct extraction behavior for same URIs and properties. - Implemented comprehensive validation tests for RDF source configuration, covering required fields, type checks, and value constraints. - Introduced connection testing unit tests to verify functionality and error handling for various scenarios, including file existence and RDF format validation. - Developed edge case tests to handle scenarios like empty files, circular relationships, and special characters in paths. - Enhanced error handling tests to ensure actionable feedback for file not found, invalid format, and unsupported extensions.
- Updated test_duplicate_handling.py to use URIRef and Literal for RDF terms, enhancing readability and maintainability. - Modified test_rdf_config.py to specify type for config_dict, improving type safety. - Adjusted test_rdf_source_errors.py to refine error assertion logic for better clarity in failure messages.
…enhance glossary term extraction - Added `extract_custom_properties` method to `FIBODialect` for extracting FIBO-specific properties from URIs. - Updated `GlossaryTermExtractor` to utilize dialect-specific logic for identifying glossary terms and extracting custom properties. - Refactored glossary term extraction to delegate type checking and property extraction to the dialect, improving modularity and maintainability. - Enhanced RDF source configuration to ensure dialect instance is always provided for consistent behavior across dialects.
…iguration - Introduced `include_provisional` option in `RDFSourceConfig` to control the inclusion of provisional/work-in-progress terms. - Updated `FIBODialect` to filter out provisional terms based on the new configuration. - Enhanced `DialectRouter` to pass the `include_provisional` setting to dialect instances. - Modified `RDFSource` to ensure the `include_provisional` setting is correctly utilized during dialect initialization.
…ndividual support - Introduced integration tests for FIBO dialect to validate inclusion and exclusion of provisional terms based on configuration. - Added unit tests for FIBO dialect to ensure proper filtering of glossary terms by maturity level. - Implemented unit tests for NamedIndividual extraction, verifying support across different dialects and ensuring correct handling of SKOS properties. - Enhanced RDF source configuration tests to validate the `include_provisional` setting and its default behavior.
…visional filtering - Introduced a new RDF file `example_fibo_maturity.ttl` containing various maturity levels for terms, including released, provisional, and no maturity. - Created a corresponding YAML configuration file `test_example_recipe.yml` to test the filtering of provisional terms based on the `include_provisional` setting.
4ba2ba2 to
0238c98
Compare
…e warning filters - Modified the `_get_dialect` method in `GlossaryTermExtractor` to provide a default `GenericDialect` when no dialect is specified, enhancing flexibility for testing and usage. - Updated warning filters in `setup.cfg` to ignore specific SQLAlchemy warnings, improving log clarity. - Adjusted unit tests to reflect changes in dialect handling, ensuring accurate behavior when no dialect is provided.
….Class and RDFS.Class - Updated the logic in `GenericDialect` to exclude ontology construct types while allowing OWL.Class and RDFS.Class to coexist with SKOS.Concept, enhancing compatibility with RDF standards.
- Updated the rdflib dependency in setup.py to specify a version range of >=6.0.0,<7.0.0, ensuring compatibility with existing RDF handling features.
- Updated the rdflib dependency in setup.py to specify an exact version of 6.3.2, ensuring compatibility with existing RDF handling features and preventing potential issues with future releases.
…er, and URN generator - Introduced new unit tests for various edge cases in RDF dialects (Generic, FIBO, Default), including handling of empty graphs, missing labels, and special characters. - Added tests for the RDF loader to cover format validation, file handling, URL loading, and zip file scenarios. - Implemented edge case tests for the URN generator, focusing on IRI parsing and platform normalization. - Enhanced overall test coverage to ensure robustness and reliability of RDF processing components.
… requests_file - Modified the RDF plugin dependencies in setup.py to add specific versions of requests (2.32.5) and requests_file (3.0.1) alongside rdflib (6.3.2), ensuring compatibility and stability for RDF processing.
…n in documentation generation - Added checks and warnings for missing platforms and plugins when processing README and documentation files, ensuring robustness in the documentation generation process. - Improved logging to provide clearer feedback when encountering issues with platform or plugin names during the generation of custom documentation.
Summary
This PR introduces a new RDF ingestion source for DataHub, enabling ingestion of RDF/OWL ontologies (Turtle, RDF/XML, JSON-LD, N3, N-Triples) with a focus on business glossaries. The source extracts glossary terms, term hierarchies, and relationships from RDF files using standard vocabularies like SKOS, OWL, and RDFS.
What's New
Core Features
type: rdf) - Native DataHub plugin for RDF/OWL ontologiesskos:Conceptandowl:Classto DataHub GlossaryTermsskos:broaderandskos:narrowerrelationships asisRelatedTermsstateful_ingestionconfigplatform_instanceconfigArchitecture
test_connection()for connection validationCapabilities
The source supports the following DataHub capabilities:
skos:broaderandskos:narrowerstateful_ingestion.enabled: trueplatform_instanceconfigskos:definitionorrdfs:comment)Testing
Test Coverage
export_only,skip_export)Test Files
tests/unit/rdf/- Unit tests for individual componentstests/integration/rdf/- Integration tests with golden file validationDocumentation
User Documentation
docs/sources/rdf/rdf.md- Comprehensive user guide (489 lines)Recipe Examples
docs/sources/rdf/rdf_recipe.yml- Example recipes for basic and stateful ingestionIntegration Test Documentation
tests/integration/rdf/README.md- Detailed guide for running integration testsConfiguration Example
source:
type: rdf
config:
source: ./glossary.ttl
format: turtle
environment: PROD
stateful_ingestion:
enabled: true
remove_stale_metadata: true
export_only:
- glossary## Files Changed
Technical Notes
Security & Performance
Code Quality
New Files
src/datahub/ingestion/source/rdf/ingestion/rdf_source.py- Main source implementationsrc/datahub/ingestion/source/rdf/core/rdf_loader.py- RDF loading utilities with securitysrc/datahub/ingestion/source/rdf/core/urn_generator.py- URN generation with encodingsrc/datahub/ingestion/source/rdf/entities/base.py- Base interfaces for entity processingsrc/datahub/ingestion/source/rdf/entities/registry.py- Thread-safe entity registrydocs/sources/rdf/rdf.md- User documentationdocs/sources/rdf/rdf_recipe.yml- Recipe examplestests/integration/rdf/test_rdf_source.py- Integration teststests/unit/rdf/- Unit tests (multiple files)Modified Files
setup.py- Added RDF source to entry points (line 862)Breaking Changes
None - This is a new feature addition with no breaking changes to existing functionality.
Support Status
The RDF source is marked as INCUBATING (
SupportStatus.INCUBATING), indicating it's ready for community adoption but may have minor version changes in future releases based on feedback.Checklist
setup.py@platform_name,@config_class,@support_status)test_connection()implemented