Skip to content

Commit 152c194

Browse files
authored
Adding Retrievers (PR #292)
2 parents c0f2bbf + 9e8a151 commit 152c194

21 files changed

+1558
-8
lines changed

CHANGELOG.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
## Changelog
22

3-
### v1.4.8 (November 3, 2025)
3+
### v1.4.9 (December 8, 2025)
4+
- add retriever collection
5+
- add documentation for retrievers
6+
- minor bug fixings in docs
7+
- add unittest for retrievers
8+
- add new requirements (`gensim`)
9+
10+
### v1.4.8 (December 3, 2025)
411
- add alexbeck, rwthdbis, sbunlp, and skhnlp learners
512
- add documentation for learners
613
- minor bug fixings

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,5 +31,5 @@ keywords:
3131
- Large Language Models
3232
- Text-to-ontology
3333
license: MIT
34-
version: 1.4.7
34+
version: 1.4.9
3535
date-released: '2025'
35.3 KB
Loading
52.7 KB
Loading

docs/source/learners/llms4ol_challenge/sbunlp_learner.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,11 @@ The team participated in the LLMs4OL 2025 Shared Task, which included four subta
1313

1414
.. note::
1515

16-
Read more about the model at `RWTH-DBIS at LLMs4OL 2024 Tasks A and B Knowledge-Enhanced Domain-Specific Continual Learning and Prompt-Tuning of Large Language Models for Ontology Learning <https://www.tib-op.org/ojs/index.php/ocp/article/view/2491>`_.
16+
Read more about the model at `SBU-NLP at LLMs4OL 2025 Tasks A, B, and C: Stage-Wise Ontology Construction Through LLMs Without any Training Procedure <https://www.tib-op.org/ojs/index.php/ocp/article/view/2887>`_.
1717

1818
.. hint::
1919

20-
The original implementation is available at `https://github.com/MouYongli/LLMs4OL <https://github.com/MouYongli/LLMs4OL>`_ repository.
20+
The original implementation is available at `https://github.com/rarahnamoun/LLMs4OL-Challenge-ISWC-2025 <https://github.com/rarahnamoun/LLMs4OL-Challenge-ISWC-2025>`_ repository.
2121

2222

2323

docs/source/learners/retrieval.rst

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,257 @@ Similar to LLM learner, Retrieval Learner is also callable via streamlined ``Lea
127127
128128
.. hint::
129129
See `Learning Tasks <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_ for possible tasks within Learners.
130+
131+
Customization
132+
-----------------------
133+
134+
You can easily customize ``AutoRetrieverLearner`` by providing your own base retriever.
135+
136+
**Example:**
137+
138+
.. code-block:: python
139+
140+
from ontolearner.learner import AutoRetrieverLearner
141+
from ontolearner.learner.retriever import NgramRetriever
142+
143+
# Create a custom retriever (default is AutoRetriever)
144+
retriever_model = NgramRetriever()
145+
146+
# Use it as the base retriever in the learner
147+
learner = AutoRetrieverLearner(base_retriever=retriever_model)
148+
149+
# Load a model for retrieval or augmentation
150+
learner.load(model_id='...')
151+
152+
153+
.. note::
154+
155+
- The ``base_retriever`` must implement the ``AutoRetriever`` interface.
156+
- You can use any compatible retriever, e.g., ``NgramRetriever``, ``Word2VecRetriever``,
157+
or your own custom retriever.
158+
- This allows combining semantic, n-gram, or hybrid retrieval pipelines easily.
159+
160+
161+
Retriever Collection
162+
--------------------------
163+
164+
NgramRetriever
165+
~~~~~~~~~~~~~~~~~~~~~~~
166+
167+
168+
.. sidebar:: **Supported vectorizers**
169+
170+
- ``count``: Converts a collection of text documents to a matrix of token counts.
171+
- ``tfidf``: Converts a collection of raw documents to a matrix of TF-IDF features.
172+
173+
174+
The ``NgramRetriever`` is a simple, interpretable text retriever based on traditional n-gram vectorization methods, such as `CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_ and `TfidfVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>`_. It ranks documents using cosine similarity of n-gram vectors. This is useful for baseline retrieval, keyword matching, or small-scale text search tasks. The following code shows how to import ``NgramRetriever`` and load desired model with desired arguments.
175+
176+
.. code-block:: python
177+
178+
from ontolearner.learner import AutoRetrieverLearner
179+
from ontolearner.learner.retriever import NgramRetriever
180+
181+
retriever = NgramRetriever(ngram_range=(1,2), stop_words='english')
182+
183+
learner = AutoRetrieverLearner(base_retriever=retriever)
184+
185+
learner.load(model_id="tfidf") # or "count"
186+
187+
.. note::
188+
189+
For desired arguments refer to `scikit-learn > TfidfVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>`_ or `scikit-learn > CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
190+
191+
Word2VecRetriever
192+
~~~~~~~~~~~~~~~~~~~~~~~
193+
194+
.. sidebar:: How to Download Word2Vec?
195+
196+
Download the word2vec from `GoogleNews-vectors-negative300.bin.gz <https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g>`_ and then you can provide the path inside the ``.load(...)``.
197+
198+
`Word2Vec <https://arxiv.org/abs/1301.3781>`_ retriever encode documents and queries using pre-trained word embeddings. Each document is represented by the average of its word vectors, and retrieval is done via cosine similarity between query vectors and document vectors. The following code shows how to use ``Word2VecRetriever`` inside learner model:
199+
200+
.. code-block:: python
201+
202+
from ontolearner.learner import AutoRetrieverLearner
203+
from ontolearner.learner.retriever import Word2VecRetriever
204+
205+
retriever = Word2VecRetriever()
206+
207+
learner = AutoRetrieverLearner(base_retriever=retriever)
208+
209+
learner.load(model_id="path/to/word2vec.bin") # Load pre-trained Word2Vec vectors
210+
211+
.. note::
212+
213+
Learn more about Word2Vec at `https://www.tensorflow.org/text/tutorials/word2vec <https://www.tensorflow.org/text/tutorials/word2vec>`_
214+
215+
GloveRetriever
216+
~~~~~~~~~~~~~~~~~~~~~~~
217+
.. sidebar:: How to Download GloVe?
218+
219+
Download the desired GloVe models from `https://nlp.stanford.edu/projects/glove/ <https://nlp.stanford.edu/projects/glove/>`_ and then you can provide the path inside the ``.load(...)``.
220+
221+
`GloVe <https://nlp.stanford.edu/projects/glove/>`_ is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Here, the ``GloveRetriever`` operates based on GloVe model as shown in the following:
222+
223+
224+
.. code-block:: python
225+
226+
from ontolearner.learner import AutoRetrieverLearner
227+
from ontolearner.learner.retriever import GloveRetriever
228+
229+
retriever = GloveRetriever()
230+
231+
learner = AutoRetrieverLearner(base_retriever=retriever)
232+
233+
learner.load(model_id="path/to/glove.txt") # Load pre-trained GloVe vectors
234+
235+
236+
.. hint::
237+
238+
In both **Word2Vec** and **GloVe** retrievers, If a word in a word is not in the embedding vocabulary, it is ignored.
239+
240+
.. note::
241+
242+
Refer to the GloVe paper at `GloVe: Global Vectors for Word Representation <https://aclanthology.org/D14-1162/>`_ to learn more about this model.
243+
244+
CrossEncoderRetriever
245+
~~~~~~~~~~~~~~~~~~~~~~~
246+
247+
248+
.. sidebar:: Cross-Encoder Models
249+
250+
Collections of publicly available cross-encoder models are available at: `🤗 Sentence Transformers - Cross-Encoders <https://huggingface.co/cross-encoder>`_.
251+
252+
253+
Untill now, the OntoLearner ``AutoRetriever`` (base retriever for ``AutoRetrieverLearner``) were using a Bi-Encoder architecture for retrievals. It is important to understand the difference between Bi- and Cross-Encoder. The following diagram shows the differences:
254+
255+
.. raw:: html
256+
257+
<div align="center">
258+
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/Bi_vs_Cross-Encoder.jpg" alt="Bi-Encoder vs Cross-Encoder " width="40%"/>
259+
</div>
260+
<br>
261+
262+
263+
Bi-Encoders produce for a given sentence a sentence embedding. We pass to a BERT independently the sentences A and B, which result in the sentence embeddings u and v. These sentence embedding can then be compared using cosine similarity. In contrast, for a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair. A Cross-Encoder does not produce a sentence embedding. Also, we are not able to pass individual sentences to a Cross-Encoder (Reference: `Sentence-BERT > Cross-Encoder <https://sbert.net/examples/cross_encoder/applications/README.html>`_).
264+
265+
266+
Here, in the OntoLearner, we implemented a ``CrossEncoderRetriever``, a hybrid dense retriever that combines a BiEncoder for fast candidate retrieval and a CrossEncoder for accurate reranking. Overall ``CrossEncoderRetriever`` uses Bi-Encoder based model for retrieval and Cross-Encoder model for reranking. This provides an efficient and accurate alternative to pure Cross-Encoder or pure Bi-Encoder approaches. To use ``CrossEncoderRetriever`` simply follow the following steps:
267+
268+
269+
.. code-block:: python
270+
271+
from ontolearner.learner import AutoRetrieverLearner
272+
from ontolearner.learner.retriever import CrossEncoderRetriever
273+
274+
retriever = CrossEncoderRetriever(bi_encoder_model_id='Qwen/Qwen3-Embedding-8B') # pass the bi-encoder model ID used in the first-stage
275+
276+
learner = AutoRetrieverLearner(base_retriever=retriever)
277+
278+
learner.load(model_id="cross-encoder/ms-marco-MiniLM-L12-v2") # Model ID for the CrossEncoder (reranking model) here!
279+
# When .load(...) is instantiated, both the bi-encoder and cross-encoder models will be loaded.
280+
281+
282+
.. note::
283+
284+
Learn more about Retrieve and Rerank approach at `Sentence Transformers > Usage > Retrieve & Re-Rank <https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html>`_.
285+
286+
LLMAugmentedRetriever
287+
~~~~~~~~~~~~~~~~~~~~~~~~
288+
The LLM-Augmented retriever improves retrieval quality by expanding each query into multiple augmented variants using an LLM (e.g., GPT-4). The following diagram shows how LLM-Augmented retriever operates in comparison to usual retriever approach.
289+
290+
291+
.. raw:: html
292+
293+
<div align="center">
294+
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/llm-augmenter.jpg" alt="LLM Augmented Retriever " width="80%"/>
295+
</div>
296+
<br>
297+
298+
There are two usage modes:
299+
300+
**1. Online augmentation (using LLMAugmenterGenerator):** This mode calls the LLM directly to generate augmentation candidates.
301+
302+
.. code-block:: python
303+
304+
# Step 1 — Create the generator
305+
from ontolearner.learner.retriever import LLMAugmenterGenerator
306+
llm_augmenter_generator = LLMAugmenterGenerator(model_id='gpt-4.1-mini', token = '...', top_n_candidate=10)
307+
308+
# Step 2 — Generate augmentations for a dataset
309+
tasks = ['term-typing', 'taxonomy-discovery', 'non-taxonomic-re']
310+
augments = {"config": llm_augmenter_generator.get_config()}
311+
for task in tasks:
312+
augments[task] = llm_augmenter_generator.augment(data, task=task)
313+
314+
# Step 3 — Save augmentations
315+
from ontolearner.utils import save_json
316+
save_json("augment.json", augments)
317+
318+
The online augmentation is designed to avoid multiple calls to the models that may lead into expensive API usage and waiting time. Once the augmenter generator output is stored, it can be used for next stage.
319+
320+
**2. Offline augmentation (recommended for large experiments):** Instead of calling the LLM repeatedly, you load the previously saved augmentations.
321+
322+
323+
.. code-block:: python
324+
325+
# Step 1 — Load augmenter
326+
from ontolearner.learner.retriever import LLMAugmenter
327+
augmenter = LLMAugmenter("augment.json")
328+
329+
# Step 2 — Attach it to the retriever
330+
from ontolearner.learner.retriever import LLMAugmentedRetriever
331+
from ontolearner.learner import LLMAugmentedRetrieverLearner
332+
333+
base_retriever = LLMAugmentedRetriever()
334+
learner = LLMAugmentedRetrieverLearner(base_retriever=base_retriever)
335+
learner.set_augmenter(augmenter)
336+
learner.load(model_id="Qwen/Qwen3-Embedding-8B") # path to desired retriever model.
337+
338+
Here the ``LLMAugmentedRetrieverLearner`` is the high-level wrapper that orchestrates the loading a retriever model, attaching the ``LLMAugmentedRetriever``, automatically applying LLM-based query expansion during training and prediction, and computing ground truth and returning predictions.
339+
340+
341+
342+
.. list-table:: Summary of Components:
343+
:header-rows: 1
344+
:widths: 25 75
345+
346+
* - Component
347+
- Purpose
348+
* - ``LLMAugmenterGenerator``
349+
- Calls an LLM (GPT-4, GPT-3.5, etc.) to generate augmentation data.
350+
* - ``LLMAugmenter``
351+
- Loads offline augmentations (``augment.json``).
352+
* - ``LLMAugmentedRetriever``
353+
- Expands each query using augmentations before retrieval.
354+
* - ``LLMAugmentedRetrieverLearner``
355+
- Applies the learner pipeline using the augmented retriever.
356+
357+
.. rubric:: Example: Using LLMAugmentedRetrieverLearner for Taxonomy Discovery
358+
359+
.. code-block:: python
360+
361+
from ontolearner.learner.retriever import LLMAugmenterGenerator, LLMAugmentedRetriever, LLMAugmenter
362+
from ontolearner import LLMAugmentedRetrieverLearner, Wine, train_test_split, evaluation_report
363+
364+
ontology = Wine()
365+
ontology.load()
366+
ontological_data = ontology.extract()
367+
train_data, test_data = train_test_split(ontological_data, test_size=0.2, random_state=42)
368+
369+
task="taxonomy-discovery"
370+
371+
llm_augmenter_generator = LLMAugmenterGenerator(model_id='gpt-4.1-mini', token = 'your_openai_token', top_n_candidate=10)
372+
augments = {"config": llm_augmenter_generator.get_config()}
373+
augments[task] = llm_augmenter_generator.augment(ontological_data, task=task)
374+
375+
learner.set_augmenter(augments)
376+
learner.load(model_id="Qwen/Qwen3-Embedding-8B")
377+
378+
# Train, Predict, and Evaluate
379+
learner.fit(train_data, task=task)
380+
predictions = learner.predict(test_data, task=task)
381+
truth = learner.tasks_ground_truth_former(test_data, task=task)
382+
metrics = evaluation_report(truth, predictions, task=task)
383+
print(metrics)

ontolearner/VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.4.8
1+
1.4.9

ontolearner/learner/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# limitations under the License.
1414

1515
from .llm import AutoLLMLearner, FalconLLM, MistralLLM
16-
from .retriever import AutoRetrieverLearner
16+
from .retriever import AutoRetrieverLearner, LLMAugmentedRetrieverLearner
1717
from .rag import AutoRAGLearner
1818
from .prompt import StandardizedPrompting
1919
from .label_mapper import LabelMapper
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright (c) 2025 SciKnowOrg
2+
#
3+
# Licensed under the MIT License (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# https://opensource.org/licenses/MIT
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from .crossencoder import CrossEncoderRetriever
16+
from .embedding import GloveRetriever, Word2VecRetriever
17+
from .ngram import NgramRetriever
18+
from .learner import AutoRetrieverLearner, LLMAugmentedRetrieverLearner
19+
from .llm_retriever import LLMAugmenterGenerator, LLMAugmenter, LLMAugmentedRetriever

0 commit comments

Comments
 (0)