|
| 1 | +Knowledge Graph Embedding |
| 2 | +================================ |
| 3 | + |
| 4 | +Graph Embeddings |
| 5 | +--------------------------------- |
| 6 | + |
| 7 | +Ontology alignment involves finding correspondences between entities in different ontologies. OntoAligner addresses this challenge by leveraging **Knowledge Graph Embedding (KGE)** models. The core idea of KGE is to represent entities (like classes, properties, individuals) and relations within an ontology as **low-dimensional vectors** in a continuous vector space. These numerical representations (embeddings) are learned to preserve semantic relationships from the original ontology geometrically in the embedding space. |
| 8 | + |
| 9 | +.. hint:: |
| 10 | + |
| 11 | + **Why KGE for Alignment?** |
| 12 | + |
| 13 | + 1) *Semantic Preservation*: KGE models aim to capture the meaning and relationships of entities in their vector representations. |
| 14 | + 2) *Scalability*: Working with numerical vectors can be more efficient for large-scale comparison than symbolic matching. |
| 15 | + 3) *Similarity Measurement*: Once entities are embedded, their semantic similarity can be easily measured (e.g., using cosine similarity). |
| 16 | + |
| 17 | + |
| 18 | +OntoAligner's KGE-based alignment process involves several key components that work in sequence. These components are described in the following figure within ``GraphEmbeddingsAligner``. |
| 19 | + |
| 20 | +.. raw:: html |
| 21 | + |
| 22 | + <div align="center"> |
| 23 | + <img src="https://raw.githubusercontent.com/sciknoworg/OntoAligner/refs/heads/dev/docs/source/img/kge.jpg" width="80%"/> |
| 24 | + </div> |
| 25 | + |
| 26 | + |
| 27 | +Usage |
| 28 | +------------ |
| 29 | + |
| 30 | +.. sidebar:: |
| 31 | + |
| 32 | + Full code is available at `OntoAligner Repository. <https://github.com/sciknoworg/OntoAligner/blob/main/examples/kge.py>`_ |
| 33 | + |
| 34 | + |
| 35 | +This module guides you through a step-by-step process for performing ontology alignment using a KGEs and the OntoAligner library. By the end, you’ll understand how to preprocess data, encode ontologies, generate alignments, evaluate results, and save the outputs in XML and JSON formats. |
| 36 | + |
| 37 | + |
| 38 | +.. tab:: ➡️ 1: Parser |
| 39 | + |
| 40 | + The first step is to prepare the ontology data for the KGE model. The **Parser** transforms raw ontology information into a structured format suitable for KGE models. |
| 41 | + |
| 42 | + .. code-block:: python |
| 43 | +
|
| 44 | + from ontoaligner.ontology import GraphTripleOMDataset |
| 45 | +
|
| 46 | + task = GraphTripleOMDataset() |
| 47 | + task.ontology_name = "Mouse-Human" |
| 48 | + print("task:", task) |
| 49 | + # >>> task: Track: GraphTriple, Source-Target sets: Mouse-Human |
| 50 | +
|
| 51 | + dataset = task.collect( |
| 52 | + source_ontology_path="assets/mouse-human/source.xml", |
| 53 | + target_ontology_path="assets/mouse-human/target.xml", |
| 54 | + reference_matching_path="assets/mouse-human/reference.xml" |
| 55 | + ) |
| 56 | + print("dataset key-values:", dataset.keys()) |
| 57 | + # >>> dataset key-values: dict_keys(['dataset-info', 'source', 'target', 'reference']) |
| 58 | +
|
| 59 | + print("Sample source ontology:", dataset['source'][0]) |
| 60 | +
|
| 61 | + This will result in the sample source ontology with following metadata: |
| 62 | + |
| 63 | + .. code-block:: javascript |
| 64 | +
|
| 65 | + [ |
| 66 | + { |
| 67 | + 'subject': ('http://mouse.owl#MA_0000143', 'tonsil'), |
| 68 | + 'predicate': ('http://www.w3.org/1999/02/22-rdf-syntax-ns#type', 'type'), |
| 69 | + 'object': ('http://www.w3.org/2002/07/owl#Class', 'Class'), |
| 70 | + 'subject_is_class': True, |
| 71 | + 'object_is_class': False |
| 72 | + }, |
| 73 | + ... |
| 74 | + ] |
| 75 | + :: |
| 76 | + |
| 77 | +.. tab:: ➡️ 2: Encoder |
| 78 | + |
| 79 | + Once the soruce and target ontologies are parsed, the ``GraphTripleEncoder`` creates a triplet representations. The triplet representation is in ``[(Subject Label, Predicate Label, Object Label), ... ]`` format, which is standard input for KGE models. |
| 80 | + |
| 81 | + .. code-block:: python |
| 82 | +
|
| 83 | + from ontoaligner.encoder import GraphTripleEncoder |
| 84 | +
|
| 85 | + encoder = GraphTripleEncoder() |
| 86 | + encoded_dataset = encoder(**dataset) |
| 87 | + :: |
| 88 | + |
| 89 | +.. tab:: ➡️ 3: Aligner |
| 90 | + |
| 91 | + |
| 92 | + After triplets are generated, they are fed into the KGE model. This is the core engine that learns low-dimensional embeddings for all entities and relations present in the triplets. Here lets use ``CovEAligner``, it is a specific implementation of the KGE-based aligner (specifically `ConvE <https://aaai.org/papers/11573-convolutional-2d-knowledge-graph-embeddings/>`_) within the OntoAligner library. It encapsulates the entire process from data ingestion and embedding learning to alignment prediction. |
| 93 | + |
| 94 | + .. code-block:: python |
| 95 | +
|
| 96 | + from ontoaligner.aligner import ConvEAligner |
| 97 | +
|
| 98 | + kge_params = { |
| 99 | + 'device': 'cpu', # str: Device to use for training ('cpu' or 'cuda') |
| 100 | + 'embedding_dim': 300, # int: Dimensionality of learned embeddings |
| 101 | + 'num_epochs': 50, # int: Number of training epochs |
| 102 | + 'train_batch_size': 128, # int: Number of positive triplets per training batch |
| 103 | + 'eval_batch_size': 64, # int: Number of triplets per evaluation batch |
| 104 | + 'num_negs_per_pos': 5, # int: Number of negative samples per positive triplet |
| 105 | + 'random_seed': 42, # int: Seed for reproducibility |
| 106 | + } |
| 107 | +
|
| 108 | + aligner = ConvEAligner(**kge_params) |
| 109 | +
|
| 110 | + matchings = aligner.generate(input_data=encoded_dataset) |
| 111 | +
|
| 112 | + .. note:: |
| 113 | + |
| 114 | + The ``.generate`` function will do the training and then matching. |
| 115 | + |
| 116 | + :: |
| 117 | + |
| 118 | +.. tab:: ➡️ 4: Post-Process |
| 119 | + |
| 120 | + This step focuses on post-processing predicted matchings, potentially utilizing a similarity score for filtering and applying cardinality based processing, and subsequently evaluating their quality against a reference dataset to assess performance before and after post-processing. |
| 121 | + |
| 122 | + .. code-block:: python |
| 123 | +
|
| 124 | + from ontoaligner.postprocess import graph_postprocessor |
| 125 | +
|
| 126 | + processed_matchings = graph_postprocessor(predicts=matchings, threshold=0.5) |
| 127 | +
|
| 128 | + :: |
| 129 | + |
| 130 | +.. tab:: ➡️ 5: Evaluate and Export |
| 131 | + |
| 132 | + The following code will compare the generated alignments with reference matchings. Then save the matchings in both XML and JSON formats for further analysis or use. Feel free to use any of the techniques. |
| 133 | + |
| 134 | + .. code-block:: python |
| 135 | +
|
| 136 | + from ontoaligner.utils import metrics |
| 137 | +
|
| 138 | + evaluation = metrics.evaluation_report(predicts=matchings, references=dataset['reference']) |
| 139 | + print("Matching Evaluation Report:\n", evaluation) |
| 140 | +
|
| 141 | + evaluation = metrics.evaluation_report(predicts=processed_matchings, references=dataset['reference']) |
| 142 | + print("Matching Evaluation Report -- after post-processing:\n", evaluation) |
| 143 | +
|
| 144 | +
|
| 145 | + .. tab:: 📄 <> Export matchings to XML |
| 146 | + |
| 147 | + :: |
| 148 | + |
| 149 | + from ontoaligner.utils import metrics |
| 150 | + |
| 151 | + xml_str = xmlify.xml_alignment_generator(matchings=processed_matchings) |
| 152 | + with open("matchings.xml", "w", encoding="utf-8") as xml_file: |
| 153 | + xml_file.write(xml_str) |
| 154 | + |
| 155 | + .. tab:: # 🧾 {} Export matchings to JSON |
| 156 | + |
| 157 | + :: |
| 158 | + |
| 159 | + with open("matchings.json", "w", encoding="utf-8") as json_file: |
| 160 | + json.dump(processed_matchings, json_file, indent=4, ensure_ascii=False) |
| 161 | + :: |
| 162 | + |
| 163 | + |
| 164 | + |
| 165 | + |
| 166 | + |
| 167 | + |
| 168 | + |
| 169 | +KGE Aligners |
| 170 | +---------------------- |
| 171 | + |
| 172 | + |
| 173 | + |
| 174 | +The ``ontoaligner.aligner.graph`` module provides a suite of graph embedding-based aligners built on top of popular KGE models. These aligners leverage link prediction objectives and low-dimensional vector spaces to learn semantic representations of entities, facilitating accurate ontology alignment even across heterogeneous structures. Each aligner wraps a specific KGE model implemented through the PyKEEN framework, allowing plug-and-play integration and consistent similarity scoring across models. Some models include custom similarity functions to better capture semantic distance in complex embedding spaces (e.g., complex numbers or quaternions). |
| 175 | + |
| 176 | +The following table lists the available KGE aligners: |
| 177 | + |
| 178 | +.. list-table:: |
| 179 | + :widths: 20 70 10 |
| 180 | + :header-rows: 1 |
| 181 | + |
| 182 | + * - Aligner Name |
| 183 | + - Description |
| 184 | + - Link |
| 185 | + |
| 186 | + * - ``ConvEAligner`` |
| 187 | + - Based on ConvE, which uses 2D convolutions over reshaped entity and relation embeddings to model complex interactions. |
| 188 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L17-L18>`_ |
| 189 | + * - ``TransDAligner`` |
| 190 | + - Based on TransD, which constructs relation-specific projection matrices dynamically from both entity and relation vectors. |
| 191 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L21-L22>`_ |
| 192 | + * - ``TransEAligner`` |
| 193 | + - Based on TransE, a translation-based model that learns embeddings where :math:`h + r \approx t`. |
| 194 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L25-L26>`_ |
| 195 | + * - ``TransFAligner`` |
| 196 | + - Based on TransF, which enables flexible translations for complex relations without increasing model complexity. |
| 197 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L29-L230>`_ |
| 198 | + * - ``TransHAligner`` |
| 199 | + - Based on TransH, which projects entities onto relation-specific hyperplanes before translation. |
| 200 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L33-L234>`_ |
| 201 | + * - ``TransRAligner`` |
| 202 | + - Based on TransR, which embeds entities and relations in separate spaces using relation-specific projections. |
| 203 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L37-L38>`_ |
| 204 | + * - ``DistMultAligner`` |
| 205 | + - Based on DistMult, a bilinear model that uses diagonal matrices for efficient relational modeling. |
| 206 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L41-L42>`_ |
| 207 | + * - ``ComplExAligner`` |
| 208 | + - Based on ComplEx, which uses complex-valued embeddings to model symmetric and antisymmetric relations; includes a custom similarity function using real parts of complex dot products. |
| 209 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L45-L49>`_ |
| 210 | + * - ``HolEAligner`` |
| 211 | + - Based on HolE, which combines compositional and holographic representations using circular correlation. |
| 212 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L51-L52>`_ |
| 213 | + * - ``RotatEAligner`` |
| 214 | + - Based on RotatE, which models relations as rotations in complex space and supports rich relational patterns; includes a similarity override. |
| 215 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L55-L60>`_ |
| 216 | + * - ``SimplEAligner`` |
| 217 | + - Based on SimplE, which learns dependent embeddings for each entity and supports fully expressive factorization. |
| 218 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L62-L63>`_ |
| 219 | + * - ``CrossEAligner`` |
| 220 | + - Based on CrossE, which learns both general and triple-specific embeddings to capture bidirectional interactions. |
| 221 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L66-L67>`_ |
| 222 | + * - ``BoxEAligner`` |
| 223 | + - Based on BoxE, which models relations as boxes in vector space to support hierarchies and logical rules. |
| 224 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L70-L71>`_ |
| 225 | + * - ``CompGCNAligner`` |
| 226 | + - Based on CompGCN, a graph convolutional network designed for multi-relational graphs using composition operations. |
| 227 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L74-L75>`_ |
| 228 | + * - ``MuREAligner`` |
| 229 | + - Based on MuRE, which embeds entities in hyperbolic space to better model hierarchies and relation-specific transformations. |
| 230 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L78-L79>`_ |
| 231 | + * - ``QuatEAligner`` |
| 232 | + - Based on QuatE, which uses quaternion embeddings and custom similarity logic to model expressive 4D rotations and relational structure. |
| 233 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L82-L133>`_ |
| 234 | + * - ``SEAligner`` |
| 235 | + - Based on SE, a neural model that embeds symbolic knowledge into vector space using learned neural transformations. |
| 236 | + - `Source <https://github.com/sciknoworg/OntoAligner/blob/main/ontoaligner/aligner/kge/models.py#L134-L135>`_ |
| 237 | + |
| 238 | +To use KGE aligner based technique: |
| 239 | + |
| 240 | +.. code-block:: python |
| 241 | +
|
| 242 | + from ontoaligner.aligner import TransEAligner |
| 243 | +
|
| 244 | + aligner = TransEAligner() |
| 245 | +
|
| 246 | + matchings = aligner.generate(input_data=...) |
| 247 | +
|
| 248 | +If the desired model is not avaliable in OntoAligner, then: |
| 249 | + |
| 250 | +.. code-block:: python |
| 251 | +
|
| 252 | + from ontoaligner.aligner.graph import GraphEmbeddingAligner |
| 253 | +
|
| 254 | + class CustomKGEAligner(GraphEmbeddingAligner): |
| 255 | + model = "RESCAL" |
| 256 | +
|
| 257 | + aligner = CustomKGEAligner() |
| 258 | + matchings = aligner.generate(input_data=...) |
| 259 | +
|
| 260 | +
|
| 261 | +Here ``RESCAL`` is our custom KGE model. |
| 262 | + |
| 263 | +.. note:: |
| 264 | + |
| 265 | + For possible models please take a look at `PyKEEN > Models <https://pykeen.readthedocs.io/en/latest/reference/models.html#classes>`_. |
0 commit comments