Skip to content

Commit b6fc1ee

Browse files
authored
Merge pull request #268 from sciknoworg/dev
add ontolearner Dublin Core metadata exporter
2 parents 15c4ed1 + ad102ed commit b6fc1ee

File tree

9 files changed

+287
-10
lines changed

9 files changed

+287
-10
lines changed

.github/workflows/python-publish.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,18 @@ jobs:
3737
- name: Publish to PyPI
3838
run: |
3939
poetry publish --no-interaction --no-ansi
40+
41+
# 🔹 NEW STEP: Generate metadata after publishing
42+
- name: Generate Dublin Core metadata
43+
run: |
44+
mkdir -p metadata
45+
poetry run python -c "from ontolearner import OntoLearnerMetadataExporter; OntoLearnerMetadataExporter().export('metadata/ontolearner-metadata.rdf')"
46+
47+
# 🔹 Commit metadata back to repo
48+
- name: Commit and push metadata
49+
run: |
50+
git config --global user.name "github-actions[bot]"
51+
git config --global user.email "github-actions[bot]@users.noreply.github.com"
52+
git add metadata/
53+
git commit -m ":bookmark: Update metadata after release"
54+
git push origin HEAD:main

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,7 @@ or GitHub repository:
185185
ontologizer/ontology_modularization
186186
ontologizer/ontology_hosting
187187
ontologizer/new_ontologies
188+
ontologizer/metadata
188189

189190
.. toctree::
190191
:maxdepth: 1
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
Metadata
2+
=============================
3+
4+
.. note::
5+
6+
OntoLearner Metadata will be created automatically at Github under `metadata/ <https://github.com/sciknoworg/OntoLearner/tree/main/metadata>`_ directory, and it is available for download after ``ontolearner > 1.3.1`` also at `Releases <https://github.com/sciknoworg/OntoLearner/releases>`_ per release.
7+
8+
.. hint::
9+
10+
The metadata release is fully automated through CI/CD, ensuring it is generated automatically with each PyPI release.
11+
12+
.. sidebar:: OntoLearner Metadata Exporter Features
13+
14+
- Generates `Dublin Core metadata <https://www.dublincore.org/specifications/dublin-core/dces/>`_ for each ontology in the library
15+
- Creates a top-level ``Collection`` resource for OntoLearner
16+
- Supports RDF/XML serialization in a clean, human-readable format
17+
- Uses a custom ``ontologizer`` namespace for ontology-specific resources
18+
19+
20+
The ``OntoLearnerMetadataExporter`` is a utility class for generating **Dublin Core (DCMI) metadata** for all ontologies benchmarked in the OntoLearner library. It collects essential metadata, including ontology title and description, creator/authors, license information, format, version, and last updated date, domain and category, and download URL. Additionally, it generates a **top-level collection resource** that describes the entire OntoLearner benchmarking suite. The output is a **pretty-printed RDF/XML file** compatible with standard semantic web tools and parsers.
21+
22+
23+
**Example RDF structure:**
24+
25+
.. code-block:: xml
26+
27+
<rdf:RDF
28+
xmlns:dc="http://purl.org/dc/elements/1.1/"
29+
xmlns:dcterms="http://purl.org/dc/terms/"
30+
xmlns:ontologizer="https://ontolearner.readthedocs.io/ontologizer/ontology_modularization.html#"
31+
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
32+
33+
<!-- Top-level collection -->
34+
<ontologizer:Collection rdf:about="https://ontolearner.readthedocs.io/benchmarking/">
35+
<dc:title>OntoLearner Benchmark Ontologies</dc:title>
36+
<dc:description>This Dublin Core metadata collection describes ontologies benchmarked in OntoLearner. It includes information such as title, creator, format, license, and version.</dc:description>
37+
<dc:creator>OntoLearner Team</dc:creator>
38+
<dcterms:license>MIT License</dcterms:license>
39+
<dcterms:hasVersion>1.4.0</dcterms:hasVersion>
40+
</ontologizer:Collection>
41+
42+
<!-- Individual ontology metadata -->
43+
<ontologizer:Ontology rdf:about="https://ontolearner.readthedocs.io/benchmarking/medicine/ncit.html">
44+
<dc:identifier>NCIt</dc:identifier>
45+
<dcterms:title>NCI Thesaurus (NCIt)</dcterms:title>
46+
<dcterms:description>NCI Thesaurus (NCIt) is a reference terminology that includes broad coverage of the cancer domain...</dcterms:description>
47+
<dcterms:format>OWL</dcterms:format>
48+
<dcterms:date>2023-10-19</dcterms:date>
49+
<dcterms:license>Creative Commons 4.0</dcterms:license>
50+
<dcterms:source>https://terminology.tib.eu/ts/ontologies/NCIT</dcterms:source>
51+
<dcterms:subject>Medicine</dcterms:subject>
52+
<dcterms:subject>Cancer, Oncology</dcterms:subject>
53+
<dcterms:hasVersion>24.04e</dcterms:hasVersion>
54+
</ontologizer:Ontology>
55+
56+
</rdf:RDF>
57+
58+
59+
Properties
60+
-------------------------------------
61+
The following table summarizes the key **Dublin Core metadata properties** captured for each ontology in OntoLearner. It provides a quick overview of the ontology’s identifier, title, description, authorship, format, license, domain, and version information, helping users understand and reference the ontologies consistently.
62+
63+
.. list-table:: **OntoLearner Metadata Properties**
64+
:header-rows: 0
65+
:widths: 40 40 40
66+
67+
* - **Property**
68+
- **Example**
69+
- **Description**
70+
* - ``dc:identifier``
71+
- NCIt
72+
- Ontology ID
73+
* - ``dcterms:title``
74+
- NCI Thesaurus (NCIt)
75+
- Ontology full name
76+
* - ``dcterms:description``
77+
- See above example RDF structure
78+
- Detailed ontology description
79+
* - ``dcterms:creator``
80+
- NCI
81+
- Creator / author
82+
* - ``dcterms:format``
83+
- OWL
84+
- Ontology format
85+
* - ``dcterms:date``
86+
- 2023-10-19
87+
- Last updated
88+
* - ``dcterms:license``
89+
- Creative Commons 4.0
90+
- License information
91+
* - ``dcterms:source``
92+
- URL
93+
- Download or reference URL
94+
* - ``dcterms:subject``
95+
- Medicine
96+
- Domain or category
97+
* - ``dcterms:hasVersion``
98+
- 24.04e
99+
- Ontology version
100+
101+
The following represents the benchmark collection info. The `dcterms:hasVersion` represents the library version that the metadata was released.
102+
103+
.. code-block:: xml
104+
105+
<ontologizer:Collection rdf:about="https://ontolearner.readthedocs.io/benchmarking/">
106+
<dc:title>OntoLearner Benchmark Ontologies</dc:title>
107+
<dc:description>This Dublin Core metadata collection describes ontologies benchmarked in OntoLearner. It includes information such as title, creator, format, license, and version.</dc:description>
108+
<dc:creator>OntoLearner Team</dc:creator>
109+
<dcterms:license>MIT License</dcterms:license>
110+
<dcterms:hasVersion>1.4.0</dcterms:hasVersion>
111+
</ontologizer:Collection>
112+
113+
Exporter
114+
--------------------
115+
116+
``OntoLearnerMetadataExporter`` is included in the OntoLearner library, which you can store the ontology locally.
117+
118+
.. code-block:: python
119+
120+
from ontolearner import OntoLearnerMetadataExporter
121+
122+
# Initialize exporter
123+
exporter = OntoLearnerMetadataExporter()
124+
125+
# Export metadata to RDF/XML
126+
exporter.export("ontolearner-metadata.rdf")
127+
128+
The above code outputs:
129+
130+
- **File:** ``ontolearner-metadata.rdf``
131+
- **Format:** Pretty-printed RDF/XML
132+
- **Content:** metadata for each ontology
133+
134+
The top-level collection describes the entire OntoLearner benchmark, while each ontology entry includes detailed metadata using Dublin Core and DCTERMS properties.
135+
136+
.. hint::
137+
138+
**Namespace Bindings:** The exporter uses the following namespaces in the RDF output:
139+
140+
- ``dc``: ``http://purl.org/dc/elements/1.1/``
141+
- ``dcterms``: ``http://purl.org/dc/terms/``
142+
- ``ontologizer``: ``https://ontolearner.readthedocs.io/ontologizer/ontology_modularization.html#``
143+
- ``rdf``: ``http://www.w3.org/1999/02/22-rdf-syntax-ns#``
144+
145+
.. note::
146+
147+
- The **Collection resource** always appears first in the RDF/XML output.
148+
- Individual ontologies are serialized as ``ontologizer:Ontology`` resources.
149+
- The ``export()`` method automatically reads the OntoLearner library version from the ``VERSION`` file.
150+
- The RDF/XML output is compatible with standard semantic web tools like **Protégé**, **RDFLib**, and **Apache Jena**.

metadata/metadata-exporter.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
2+
from ontolearner import OntoLearnerMetadataExporter
3+
4+
exporter = OntoLearnerMetadataExporter()
5+
6+
exporter.export("metadata.rdf")

ontolearner/VERSION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
1.3.1

ontolearner/__init__.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,10 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14+
from pathlib import Path
1415

15-
__version__ = "1.3.1"
16+
# Load version from VERSION file
17+
__version__ = (Path(__file__).parent / "VERSION").read_text().strip()
1618

1719
import logging
1820
from ontolearner import (ontology,
@@ -22,7 +24,7 @@
2224
tools,
2325
data_structure)
2426
from .ontology import * # noqa
25-
from ._ontology import AutoOntology
27+
from ._ontology import AutoOntology, OntoLearnerMetadataExporter
2628
from .learner import (AutoLLMLearner,
2729
AutoRetrieverLearner,
2830
AutoRAGLearner,
@@ -38,6 +40,7 @@
3840
__all__ = [
3941
"AutoLLMLearner",
4042
"AutoOntology",
43+
"OntoLearnerMetadataExporter",
4144
"AutoRetrieverLearner",
4245
"AutoRAGLearner",
4346
"StandardizedPrompting",

ontolearner/_ontology.py

Lines changed: 98 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,11 @@
1313
# limitations under the License.
1414

1515
import inspect
16+
import os
17+
import xml.etree.ElementTree as ET
18+
from rdflib import Graph, URIRef, Literal, Namespace, RDF
19+
from xml.dom import minidom
20+
1621
import ontolearner.ontology as ontology_module
1722
from .base import BaseOntology
1823

@@ -38,12 +43,12 @@ class AutoOntology:
3843
Example:
3944
>>> auto_onto = AutoOntology("AgrO")
4045
>>> print(type(auto_onto))
41-
<class 'ontolearner.ontology.AgrO'>
46+
>>> <class 'ontolearner.ontology.AgrO'>
4247
4348
If no class matches "unknownontology":
4449
>>> auto_onto = AutoOntology("unknownontology")
4550
>>> print(type(auto_onto))
46-
<class 'ontolearner.base.BaseOntology'>
51+
>>> <class 'ontolearner.base.BaseOntology'>
4752
"""
4853
def __new__(self, ontology_id) -> BaseOntology:
4954
for name, obj in inspect.getmembers(ontology_module):
@@ -53,3 +58,94 @@ def __new__(self, ontology_id) -> BaseOntology:
5358
if str(obj).split("'")[-2].split(".")[-1].lower() == ontology_id.lower():
5459
return instance
5560
return BaseOntology()
61+
62+
63+
64+
class OntoLearnerMetadataExporter:
65+
"""Generates Dublin Core metadata for ontology classes."""
66+
def __init__(self):
67+
self.format: str = "pretty-xml"
68+
69+
def get_url(self, domain, ontology_id):
70+
return f"https://ontolearner.readthedocs.io/benchmarking/{domain.lower().replace(' ', '_')}/{ontology_id.lower()}.html"
71+
72+
def export(self, path: str = "DCMI-Metadata.rdf"):
73+
DC = Namespace("http://purl.org/dc/elements/1.1/")
74+
DCTERMS = Namespace("http://purl.org/dc/terms/")
75+
ONTOLOGIZER = Namespace("https://ontolearner.readthedocs.io/ontologizer/ontology_modularization.html#")
76+
77+
g_head = Graph()
78+
g_head.bind("dc", DC)
79+
g_head.bind("dcterms", DCTERMS)
80+
g_head.bind("ontologizer", ONTOLOGIZER)
81+
82+
collection_uri = URIRef("https://ontolearner.readthedocs.io/benchmarking/")
83+
g_head.add((collection_uri, RDF.type, ONTOLOGIZER.Collection))
84+
g_head.add((collection_uri, DC.title, Literal("OntoLearner Benchmark Ontologies")))
85+
g_head.add((collection_uri, DC.description, Literal(
86+
"This Dublin Core metadata collection describes ontologies benchmarked in OntoLearner. "
87+
"It includes information such as title, creator, format, license, and version."
88+
)))
89+
g_head.add((collection_uri, DC.creator, Literal("OntoLearner Team")))
90+
g_head.add((collection_uri, DCTERMS.license, Literal("MIT License")))
91+
g_head.add((collection_uri, DCTERMS.hasVersion,
92+
Literal(open(os.path.join(os.path.dirname(__file__), 'VERSION')).read().strip())))
93+
94+
g_body = Graph()
95+
g_body.bind("dc", DC)
96+
g_body.bind("dcterms", DCTERMS)
97+
g_body.bind("ontologizer", ONTOLOGIZER)
98+
99+
for name, obj in inspect.getmembers(ontology_module):
100+
if inspect.isclass(obj) and name != "BaseOntology":
101+
if hasattr(obj, 'load') and callable(getattr(obj, 'load')) and hasattr(obj, 'ontology_id'):
102+
onto = obj()
103+
uri = URIRef(self.get_url(onto.domain, onto.ontology_id))
104+
g_body.add((uri, RDF.type, ONTOLOGIZER.Ontology))
105+
g_body.add((uri, DC.identifier, Literal(onto.ontology_id)))
106+
g_body.add((uri, DCTERMS.title, Literal(onto.ontology_full_name)))
107+
g_body.add((uri, DCTERMS.description, Literal(onto.__doc__.replace("\n", " "))))
108+
if onto.creator:
109+
g_body.add((uri, DCTERMS.creator, Literal(onto.creator)))
110+
if onto.format:
111+
g_body.add((uri, DCTERMS['format'], Literal(onto.format)))
112+
if onto.last_updated:
113+
g_body.add((uri, DCTERMS.date, Literal(onto.last_updated)))
114+
if onto.license:
115+
g_body.add((uri, DCTERMS.license, Literal(onto.license)))
116+
if onto.download_url:
117+
g_body.add((uri, DCTERMS.source, Literal(onto.download_url)))
118+
if onto.domain:
119+
g_body.add((uri, DCTERMS.subject, Literal(onto.domain)))
120+
if onto.category:
121+
g_body.add((uri, DCTERMS.subject, Literal(onto.category)))
122+
if onto.version:
123+
g_body.add((uri, DCTERMS.hasVersion, Literal(onto.version)))
124+
125+
head_xml = g_head.serialize(format=self.format)
126+
body_xml = g_body.serialize(format=self.format)
127+
nsmap = {
128+
"rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
129+
"dc": "http://purl.org/dc/elements/1.1/",
130+
"dcterms": "http://purl.org/dc/terms/",
131+
"ontologizer": str(ONTOLOGIZER),
132+
}
133+
for p, u in nsmap.items():
134+
ET.register_namespace(p, u)
135+
136+
head_root = ET.fromstring(head_xml)
137+
body_root = ET.fromstring(body_xml)
138+
139+
rdf_tag = f'{{{nsmap["rdf"]}}}RDF'
140+
merged_root = ET.Element(rdf_tag)
141+
for child in list(head_root):
142+
merged_root.append(child)
143+
for child in list(body_root):
144+
merged_root.append(child)
145+
146+
rough_str = ET.tostring(merged_root, encoding="utf-8")
147+
reparsed = minidom.parseString(rough_str)
148+
pretty_str = reparsed.toprettyxml(indent=" ", encoding="utf-8").decode("utf-8")
149+
pretty_str = "\n".join([line for line in pretty_str.splitlines() if line.strip()])
150+
with open(path, "w", encoding="utf-8") as f:
151+
f.write(pretty_str)

pyproject.toml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
[tool.poetry]
22
name = "OntoLearner"
3-
4-
version = "1.3.1"
5-
3+
version = "0.0.0" # placeholder, will be replaced automatically
64
description = "OntoLearner: A Modular Python Library for Ontology Learning with LLMs."
75
authors = ["Hamed Babaei Giglou <hamedbabaeigiglou@gmail.com>", "Andrei C. Aioanei <andrei.c.aioanei@gmail.com>"]
86
license = "MIT License"
@@ -40,6 +38,12 @@ wheel = "*"
4038
twine = "*"
4139
pytest = "*"
4240

41+
[tool.poetry-dynamic-versioning]
42+
enable = true
43+
vcs = "git"
44+
style = "semver"
45+
pattern = "tag"
46+
4347
[build-system]
44-
requires = ["poetry-core>=1.0.0"]
45-
build-backend = "poetry.core.masonry.api"
48+
requires = ["poetry-core>=1.0.0", "poetry-dynamic-versioning>=1.4.0"]
49+
build-backend = "poetry_dynamic_versioning.backend"

setup.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
from setuptools import setup, find_packages
2+
import os
23

34
with open("README.md", encoding="utf-8") as f:
45
long_description = f.read()
56

67
setup(
78
name="OntoLearner",
8-
version="1.3.1",
9+
version=open(os.path.join(os.path.dirname(__file__), 'ontolearner/VERSION')).read().strip(),
910
author="Hamed Babaei Giglou, Andrei C. Aioanei",
1011
author_email="hamedbabaeigiglou@gmail.com, andrei.c.aioanei@gmail.com",
1112
description="OntoLearner: A Modular Python Library for Ontology Learning with LLMs.",

0 commit comments

Comments
 (0)