Skip to content

Commit 672f39d

Browse files
authored
add kge-retrieval & MacOS install issue #67
2 parents f6c1570 + dc7dfe2 commit 672f39d

File tree

8 files changed

+225
-12
lines changed

8 files changed

+225
-12
lines changed

build.sh

Lines changed: 0 additions & 2 deletions
This file was deleted.

docs/source/aligner/kge.rst

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,11 @@ Knowledge Graph Embedding
44
Graph Embeddings
55
---------------------------------
66

7+
.. sidebar:: **Reference:**
8+
9+
`OntoAligner Meets Knowledge Graph Embedding Aligners <https://arxiv.org/abs/2509.26417>`_
10+
11+
712
Ontology alignment involves finding correspondences between entities in different ontologies. OntoAligner addresses this challenge by leveraging **Knowledge Graph Embedding (KGE)** models. The core idea of KGE is to represent entities (like classes, properties, individuals) and relations within an ontology as **low-dimensional vectors** in a continuous vector space. These numerical representations (embeddings) are learned to preserve semantic relationships from the original ontology geometrically in the embedding space.
813

914
.. hint::
@@ -263,3 +268,74 @@ Here ``RESCAL`` is our custom KGE model.
263268
.. note::
264269

265270
For possible models please take a look at `PyKEEN > Models <https://pykeen.readthedocs.io/en/latest/reference/models.html#classes>`_.
271+
272+
KGE Retriever
273+
----------------------
274+
275+
.. sidebar:: Key Parameters:
276+
277+
- ``retruever``: boolean
278+
- ``top_K``: integer
279+
280+
In addition to one-to-one alignments, OntoAligner also supports retriever-based alignment. When retriever mode is enabled (``retriever=True``), the aligner returns the top-k candidate target entities for each source entity, along with their similarity scores (similar to retriever aligner). This model is useful if you want to build downstream candidate filtering pipelines, apply human-in-the-loop validation, or integrate with reranking modules (e.g., LLMs or supervised classifiers).
281+
282+
Here is the example on how to use KGE Aligner as a retriever model:
283+
284+
.. code-block:: python
285+
286+
from ontoaligner.aligner import TransEAligner
287+
288+
# Enable retriever mode and request top-3 candidates per source entity
289+
aligner = TransEAligner(retriever=True, top_k=3)
290+
291+
matchings = aligner.generate(input_data=encoded_dataset)
292+
293+
.. list-table::
294+
:widths: 20 80
295+
:header-rows: 1
296+
297+
* - Mode
298+
- Description
299+
300+
* - **KGE Default mode**
301+
- In KGE aligners, the default mode is ``retriever=False``, where it produces **one-to-one** alignments, where each source entity is matched to the single most similar target entity.
302+
* - **KGE Retriever mode**
303+
- In KGE aligners, the default mode is ``retriever=True``, where it produces **one-to-many** alignments, where each source entity is matched to multiple target entities. Example output:
304+
305+
306+
307+
.. tab:: ➡️ KGE Retriever Mode Example output
308+
309+
::
310+
311+
[
312+
{
313+
"source": "http://mouse.owl#MA_0000143",
314+
"target-cands": [
315+
"http://human.owl#HBA_0000214",
316+
"http://human.owl#HBA_0000762",
317+
"http://human.owl#HBA_0000891"
318+
],
319+
"score-cands": [0.87, 0.82, 0.77]
320+
},
321+
...
322+
]
323+
324+
325+
326+
.. tab:: ➡️ KGE Default Mode Example output
327+
328+
::
329+
330+
{
331+
'source': 'http://mouse.owl#MA_0000143',
332+
'target': 'http://human.owl#HBA_0000214',
333+
'score': 0.87
334+
}
335+
336+
337+
.. note::
338+
339+
Consider reading the following section next:
340+
341+
* `Package Reference > Aligners <../package_reference/aligners.html>`_

docs/source/package_reference/aligners.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,4 @@ KGE Aligners
6363
:members:
6464
:undoc-members:
6565
:show-inheritance:
66-
:special-members:
66+
:special-members: __init__, generate

ontoaligner/aligner/graph/graph.py

Lines changed: 40 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -52,11 +52,13 @@ class GraphEmbeddingAligner(BaseOMModel):
5252

5353
def __init__(self,
5454
device: str='cpu',
55+
retriever: bool = False,
5556
embedding_dim: int=300,
5657
num_epochs: int=50,
5758
train_batch_size: int=128,
5859
eval_batch_size: int=64,
5960
num_negs_per_pos: int=5,
61+
top_k: int=5,
6062
random_seed: int=42):
6163
"""
6264
Initializes the GraphEmbeddingAligner with training configuration.
@@ -71,11 +73,13 @@ def __init__(self,
7173
random_seed (int): Random seed for reproducibility.
7274
"""
7375
super().__init__(device=device,
76+
retriever=retriever,
7477
embedding_dim=embedding_dim,
7578
num_epochs=num_epochs,
7679
train_batch_size=train_batch_size,
7780
eval_batch_size=eval_batch_size,
7881
num_negs_per_pos=num_negs_per_pos,
82+
top_k=top_k,
7983
random_seed=random_seed)
8084

8185
def fit(self, triplets: List):
@@ -108,6 +112,7 @@ def fit(self, triplets: List):
108112
def _similarity_matrix(self, source_onto_tensor, target_onto_tensor):
109113
return torch.matmul(source_onto_tensor, target_onto_tensor.T)
110114

115+
111116
def predict(self, source_onto: Dict, target_onto: Dict):
112117
"""
113118
Aligns entities from the source ontology to entities in the target ontology
@@ -134,16 +139,45 @@ def predict(self, source_onto: Dict, target_onto: Dict):
134139

135140
similarity_matrix = self._similarity_matrix(source_onto_tensor, target_onto_tensor) # shape: (n1, n2)
136141

137-
best_scores, best_indices = similarity_matrix.max(dim=1)
142+
if self.kwargs['retriever']:
143+
matches = self._retriever_predict(similarity_matrix=similarity_matrix,
144+
source_ent2iri=source_ent2iri,
145+
target_ent2iri=target_ent2iri,
146+
source_ents=source_ents,
147+
target_ents=target_ents,
148+
top_k=self.kwargs['top_k'])
149+
else:
150+
matches = self._predict(similarity_matrix=similarity_matrix,
151+
source_ent2iri=source_ent2iri,
152+
target_ent2iri=target_ent2iri,
153+
source_ents=source_ents,
154+
target_ents=target_ents)
155+
return matches
138156

139-
matches = [
140-
{
157+
def _retriever_predict(self, similarity_matrix, source_ent2iri, target_ent2iri, source_ents, target_ents, top_k):
158+
best_scores, best_indices = similarity_matrix.topk(k=top_k, dim=1)
159+
matches = []
160+
for i, src in enumerate(source_ents):
161+
target_cands, score_cands = [], []
162+
for j in range(top_k):
163+
target_cands.append(target_ent2iri[target_ents[best_indices[i][j].item()]])
164+
score_cands.append(best_scores[i][j].item())
165+
matches.append({
166+
"source": source_ent2iri[src],
167+
"target-cands": target_cands,
168+
"score-cands": score_cands
169+
})
170+
return matches
171+
172+
def _predict(self, similarity_matrix, source_ent2iri, target_ent2iri, source_ents, target_ents):
173+
best_scores, best_indices = similarity_matrix.max(dim=1)
174+
matches = []
175+
for index in range(len(source_ents)):
176+
matches.append({
141177
"source": source_ent2iri[source_ents[index]],
142178
"target": target_ent2iri[target_ents[best_indices[index].item()]],
143179
"score": best_scores[index].item()
144-
}
145-
for index in range(len(source_ents))
146-
]
180+
})
147181
return matches
148182

149183
def get_embeddings(self):

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ openai = "1.56.0"
2727
rank_bm25 = "0.2.2"
2828
huggingface-hub = "^0.34.4"
2929
sentence-transformers = "^5.1.0"
30-
bitsandbytes = "^0.45.1"
30+
bitsandbytes = { version = ">=0.45.1,<1.0.0", markers = "platform_system == 'Linux'" }
3131
pykeen = "1.11.1"
3232

3333
[tool.poetry.dev-dependencies]

requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ openai==1.56.0
1414
rank_bm25==0.2.2
1515
huggingface-hub==0.34.4
1616
sentence-transformers==5.1.0
17-
bitsandbytes==0.45.1
17+
bitsandbytes>=0.45.1,<0.46.0; platform_system == "Linux"
1818
pykeen==1.11.1
1919
ruff
2020
pre-commit

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
"torch>=2.8.0,<3.0.0",
3232
"transformers>=4.56.0,<5.0.0",
3333
"huggingface-hub>=0.34.4,<1.0.0",
34-
"bitsandbytes>=0.45.1,<1.0.0",
34+
"bitsandbytes>=0.45.1,<1.0.0; platform_system == 'Linux'",
3535
"pykeen==1.11.1"
3636
],
3737
classifiers=[

tests/aligners/test_kge.py

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
import pytest
2+
from ontoaligner.aligner.graph import GraphEmbeddingAligner
3+
4+
@pytest.fixture
5+
def toy_retriever_ontologies():
6+
source_onto = {
7+
"entity2iri": {
8+
"s1": "http://source.org/1",
9+
"s2": "http://source.org/2",
10+
},
11+
"triplets": [
12+
("s1", "relatedTo", "s2")
13+
],
14+
}
15+
16+
target_onto = {
17+
"entity2iri": {
18+
"t1": "http://target.org/1",
19+
"t2": "http://target.org/2",
20+
"t3": "http://target.org/3",
21+
},
22+
"triplets": [
23+
("t1", "relatedTo", "t2"),
24+
("t2", "relatedTo", "t3"),
25+
],
26+
}
27+
28+
return source_onto, target_onto
29+
30+
31+
def test_retriever_topk_output(toy_retriever_ontologies):
32+
source_onto, target_onto = toy_retriever_ontologies
33+
34+
class DummyAligner(GraphEmbeddingAligner):
35+
model = "TransE" # keep it lightweight for test
36+
37+
# retriever=True → returns top-k candidates
38+
aligner = DummyAligner(retriever=True, top_k=2, num_epochs=1, embedding_dim=16)
39+
40+
results = aligner.generate([source_onto, target_onto])
41+
42+
# Check output format
43+
assert isinstance(results, list)
44+
assert all("source" in match for match in results)
45+
assert all("target-cands" in match for match in results)
46+
assert all("score-cands" in match for match in results)
47+
48+
# Check top-k constraint
49+
for match in results:
50+
assert len(match["target-cands"]) == 2
51+
assert len(match["score-cands"]) == 2
52+
assert all(isinstance(score, float) for score in match["score-cands"])
53+
54+
55+
@pytest.fixture
56+
def toy_kge_ontologies():
57+
source_onto = {
58+
"entity2iri": {
59+
"s1": "http://source.org/1",
60+
"s2": "http://source.org/2",
61+
},
62+
"triplets": [
63+
("s1", "relatedTo", "s2")
64+
],
65+
}
66+
67+
target_onto = {
68+
"entity2iri": {
69+
"t1": "http://target.org/1",
70+
"t2": "http://target.org/2",
71+
},
72+
"triplets": [
73+
("t1", "relatedTo", "t2")
74+
],
75+
}
76+
77+
return source_onto, target_onto
78+
79+
80+
81+
def test_kge_aligner_output(toy_kge_ontologies):
82+
source_onto, target_onto = toy_kge_ontologies
83+
84+
class DummyAligner(GraphEmbeddingAligner):
85+
model = "TransE" # keep it light for testing
86+
87+
# retriever=False → one-to-one mapping
88+
aligner = DummyAligner(retriever=False, num_epochs=1, embedding_dim=16)
89+
90+
results = aligner.generate([source_onto, target_onto])
91+
92+
# Check output format
93+
assert isinstance(results, list)
94+
assert all(isinstance(match, dict) for match in results)
95+
96+
# Check required keys
97+
for match in results:
98+
assert "source" in match
99+
assert "target" in match
100+
assert "score" in match
101+
102+
# Check value types
103+
assert isinstance(match["source"], str)
104+
assert isinstance(match["target"], str)
105+
assert isinstance(match["score"], float)

0 commit comments

Comments
 (0)