diff --git a/CHANGELOG.md b/CHANGELOG.md index dcc138f..9b6d2fe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,10 @@ ## Changelog +### v1.4.10 (December 8, 2025) +- add complexity score +- add documentation for metrics +- bug fixes in Ontologizer + ### v1.4.9 (December 8, 2025) - add retriever collection - add documentation for retrievers diff --git a/CITATION.cff b/CITATION.cff index da0e687..38ec1b0 100644 --- a/CITATION.cff +++ b/CITATION.cff @@ -31,5 +31,5 @@ keywords: - Large Language Models - Text-to-ontology license: MIT -version: 1.4.9 +version: 1.4.10 date-released: '2025' diff --git a/docs/source/index.rst b/docs/source/index.rst index b659d23..15d6520 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -186,6 +186,7 @@ or GitHub repository: ontologizer/ontology_hosting ontologizer/new_ontologies ontologizer/metadata + ontologizer/metrics .. toctree:: :maxdepth: 1 diff --git a/docs/source/ontologizer/metrics.rst b/docs/source/ontologizer/metrics.rst new file mode 100644 index 0000000..0b7baf5 --- /dev/null +++ b/docs/source/ontologizer/metrics.rst @@ -0,0 +1,110 @@ +Metrics +================= + +.. sidebar:: Metric Space: + + There are a dedicated Hugging Face space for `OntoLearner Benchmark Metrics `_ with analysis and live plots. + +The ``Analyzer`` class in OntoLearner provides a unified interface for computing **ontology metrics**, which can be divided into two main categories: **Topology Metrics** (capture the structural characteristics of the ontology graph) and **Dataset Metrics** (assess the quality and distribution of the extracted learning datasets). Additionally, a **complexity score** can be derived from these metrics to summarize the overall ontology richness and complexity. + +Topology Metrics +---------------- +Topology metrics describe the structure and organization of an ontology. The ``Analyzer`` computes the following key metrics: + +- **Total nodes** (``total_nodes``): Total number of nodes in the ontology graph. +- **Total edges** (``total_edges``): Total number of edges representing relations between nodes. +- **Root nodes** (``num_root_nodes``): Nodes with no incoming edges, representing top-level concepts. +- **Leaf nodes** (``num_leaf_nodes``): Nodes with no outgoing edges, representing bottom-level concepts. +- **Classes** (``num_classes``): Number of distinct ontology classes. +- **Properties** (``num_properties``): Number of distinct properties (object or datatype properties). +- **Individuals** (``num_individuals``): Number of instances associated with classes. +- **Depth metrics**: + + - ``max_depth``: Maximum hierarchical depth in the ontology. + - ``min_depth``: Minimum hierarchical depth. + - ``avg_depth``: Average hierarchical depth across all nodes. + - ``depth_variance``: Variance of depth distribution. + +- **Breadth metrics**: + + - ``max_breadth``: Maximum number of nodes at any single hierarchy level. + - ``min_breadth``: Minimum number of nodes at any hierarchy level. + - ``avg_breadth``: Average number of nodes per hierarchy level. + - ``breadth_variance``: Variance of breadth distribution. + +Dataset Metrics +--------------- + +Dataset metrics evaluate the characteristics of machine-learning datasets extracted from the ontology. These metrics include: + +- **Number of term-type mappings** (``num_term_types``): Number of terms associated with types. +- **Number of taxonomic (is-a) relations** (``num_taxonomic_relations``): Count of hierarchical relations. +- **Number of non-taxonomic relations** (``num_non_taxonomic_relations``): Count of semantic associations not in the hierarchy. +- **Average terms per type** (``avg_terms``): Measures dataset balance across classes. + + +Complexity Score +---------------- + +The **complexity score** combines topology and dataset metrics into a single normalized score in ``[0, 1]``. First, metrics are **log-normalized** and weighted by category: + +.. list-table:: + :header-rows: 1 + :widths: 25 50 25 + + * - Metric Category + - Example Metrics + - Weight + * - Graph structure + - ``total_nodes``, ``total_edges``, ``num_root_nodes``, ``num_leaf_nodes`` + - 0.3 + * - Knowledge coverage + - ``num_classes``, ``num_properties``, ``num_individuals`` + - 0.25 + * - Hierarchy + - ``max_depth``, ``min_depth``, ``avg_depth``, ``depth_variance`` + - 0.10 + * - Breadth + - ``max_breadth``, ``min_breadth``, ``avg_breadth``, ``breadth_variance`` + - 0.20 + * - Dataset (LLMs4OL) + - ``num_term_types``, ``num_taxonomic_relations``, ``num_non_taxonomic_relations``, ``avg_terms`` + - 0.15 + + +Next, the weighted sum of metrics is passed through a **logistic function** to normalize the final complexity score. + + +Example Usage +------------- + +Here is a simple example demonstrating how to compute metrics and complexity for an ontology: + +.. code-block:: python + + from ontolearner.tools import Analyzer + from ontolearner.ontology import Wine + + # Step 1 — Load ontology + ontology = Wine() + ontology.build_graph() + + # Step 2 — Create the analyzer + analyzer = Analyzer() + + # Step 3 — Compute topology and dataset metrics + topology_metrics = analyzer.compute_topology_metrics(ontology) + dataset_metrics = analyzer.compute_dataset_metrics(ontology) + + # Step 4 — Compute overall complexity score + complexity_score = analyzer.compute_complexity_score( + topology_metrics=topology_metrics, + dataset_metrics=dataset_metrics + ) + # Step 5 — Display results + print("Topology Metrics:", topology_metrics) + print("Dataset Metrics:", dataset_metrics) + print("Ontology Complexity Score:", complexity_score) + + +This workflow allows ontology engineers and researchers to **quantify structural quality, dataset richness, and overall complexity**, providing actionable insights for ontology evaluation, benchmarking, and improvement. diff --git a/examples/complexity_score.py b/examples/complexity_score.py new file mode 100644 index 0000000..a6ed05f --- /dev/null +++ b/examples/complexity_score.py @@ -0,0 +1,24 @@ +from ontolearner.tools import Analyzer +from ontolearner.ontology import Wine + +# Step 1 — Load ontology +ontology = Wine() +ontology.build_graph() + +# Step 2 — Create the analyzer +analyzer = Analyzer() + +# Step 3 — Compute topology and dataset metrics +topology_metrics = analyzer.compute_topology_metrics(ontology) +dataset_metrics = analyzer.compute_dataset_metrics(ontology) + +# Step 4 — Compute overall complexity score +complexity_score = analyzer.compute_complexity_score( + topology_metrics=topology_metrics, + dataset_metrics=dataset_metrics +) + +# Step 5 — Display results +print("Topology Metrics:", topology_metrics) +print("Dataset Metrics:", dataset_metrics) +print("Ontology Complexity Score:", complexity_score) diff --git a/ontolearner/VERSION b/ontolearner/VERSION index 4ea2b1f..ac9f79c 100644 --- a/ontolearner/VERSION +++ b/ontolearner/VERSION @@ -1 +1 @@ -1.4.9 +1.4.10 diff --git a/ontolearner/base/ontology.py b/ontolearner/base/ontology.py index 1429828..77a0662 100644 --- a/ontolearner/base/ontology.py +++ b/ontolearner/base/ontology.py @@ -372,7 +372,7 @@ def _update_metrics_space(self, metrics_file_path: Path, metrics: OntologyMetric # Save updated metrics df.to_excel(metrics_file_path, index=False) - def is_valid_label(label: str) -> Any: + def is_valid_label(self, label: str) -> Any: invalids = ['root', 'thing'] if label.lower() in invalids: return None @@ -522,7 +522,7 @@ def check_if_class(self, entity): return True return False - def _is_anonymous_id(label: str) -> bool: + def _is_anonymous_id(self, label: str) -> bool: """Check if a label represents an anonymous class identifier.""" if not label: return True diff --git a/ontolearner/tools/analyzer.py b/ontolearner/tools/analyzer.py index 4a88f74..f49cb18 100644 --- a/ontolearner/tools/analyzer.py +++ b/ontolearner/tools/analyzer.py @@ -14,6 +14,7 @@ import logging import time +import numpy as np from abc import ABC from rdflib import RDF, RDFS, OWL from collections import defaultdict @@ -186,6 +187,56 @@ def compute_topology_metrics(ontology: BaseOntology) -> TopologyMetrics: return metrics + @staticmethod + def compute_complexity_score( + topology_metrics: TopologyMetrics, + dataset_metrics: DatasetMetrics, + a: float = 0.4, + b: float = 6.0, + eps: float = 1e-12 + ) -> float: + """ + Compute a single normalized complexity score for an ontology. + + This function combines structural topology metrics and dataset quality metrics + into a weighted aggregate score, then applies a logistic transformation to + normalize it to the range [0, 1]. The score reflects overall ontology complexity, + considering graph structure, hierarchy, breadth, coverage, and dataset richness. + + Args: + topology_metrics (TopologyMetrics): Precomputed structural metrics of the ontology graph. + dataset_metrics (DatasetMetrics): Precomputed metrics of extracted learning datasets. + a (float, optional): Steepness parameter for the logistic normalization function. Default is 0.4. + b (float, optional): Centering parameter for the logistic function, should be tuned to match the scale of aggregated metrics. Default is 6.0. + eps (float, optional): Small epsilon to prevent numerical issues in logistic computation. Default is 1e-12. + + Returns: + float: Normalized complexity score in [0, 1], where higher values indicate more complex ontologies. + + Notes: + - Weights are assigned to different metric categories: graph metrics, coverage metrics, hierarchy metrics, + breadth metrics, and dataset metrics (term-types, taxonomic, non-taxonomic relations). + - Metrics are log-normalized before weighting to reduce scale differences. + - The logistic transformation ensures the final score is bounded and interpretable. + """ + # Define metric categories with their weights + metric_categories = { + 0.3: ["total_nodes", "total_edges", "num_root_nodes", "num_leaf_nodes"], + 0.25: ["num_classes", "num_properties", "num_individuals"], + 0.10: ["max_depth", "min_depth", "avg_depth", "depth_variance"], + 0.20: ["max_breadth", "min_breadth", "avg_breadth", "breadth_variance"], + 0.15: ["num_term_types", "num_taxonomic_relations", "num_non_taxonomic_relations", "avg_terms"] + } + weights = {metric: weight for weight, metrics in metric_categories.items() for metric in metrics} + metrics = [metric for _, metric_list in metric_categories.items() for metric in metric_list] + onto_metrics = {**topology_metrics.__dict__, **dataset_metrics.__dict__} + norm_weighted_values = [np.log1p(onto_metrics[m]) * weights[m] for m in metrics if m in onto_metrics] + total_weight = sum(weights[m] for m in metrics if m in onto_metrics) + weighted_sum = sum(norm_weighted_values) / total_weight if total_weight > 0 else 0.0 + complexity_score = 1.0 / (1.0 + np.exp(-a * (weighted_sum - b) + eps)) + return complexity_score + + @staticmethod def compute_dataset_metrics(ontology: BaseOntology) -> DatasetMetrics: """