📝 add retriever collection docs

HamedBabaei · HamedBabaei · commit d0f0ff3d0c3a · 2025-12-08T13:21:00.000+01:00
diff --git a/docs/source/learners/retrieval.rst b/docs/source/learners/retrieval.rst
@@ -127,3 +127,257 @@ Similar to LLM learner, Retrieval Learner is also callable via streamlined ``Lea
 
 .. hint::
     See `Learning Tasks <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_ for possible tasks within Learners.
+
+Customization
+-----------------------
+
+You can easily customize ``AutoRetrieverLearner`` by providing your own base retriever.
+
+**Example:**
+
+.. code-block:: python
+
+    from ontolearner.learner import AutoRetrieverLearner
+    from ontolearner.learner.retriever import NgramRetriever
+
+    # Create a custom retriever (default is AutoRetriever)
+    retriever_model = NgramRetriever()
+
+    # Use it as the base retriever in the learner
+    learner = AutoRetrieverLearner(base_retriever=retriever_model)
+
+    # Load a model for retrieval or augmentation
+    learner.load(model_id='...')
+
+
+.. note::
+
+	- The ``base_retriever`` must implement the ``AutoRetriever`` interface.
+	- You can use any compatible retriever, e.g., ``NgramRetriever``, ``Word2VecRetriever``,
+	  or your own custom retriever.
+	- This allows combining semantic, n-gram, or hybrid retrieval pipelines easily.
+
+
+Retriever Collection
+--------------------------
+
+NgramRetriever
+~~~~~~~~~~~~~~~~~~~~~~~
+
+
+.. sidebar:: **Supported vectorizers**
+
+	- ``count``: Converts a collection of text documents to a matrix of token counts.
+	- ``tfidf``: Converts a collection of raw documents to a matrix of TF-IDF features.
+
+
+The ``NgramRetriever`` is a simple, interpretable text retriever based on traditional n-gram vectorization methods, such as `CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_ and `TfidfVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>`_. It ranks documents using cosine similarity of n-gram vectors. This is useful for baseline retrieval, keyword matching, or small-scale text search tasks. The following code shows how to import ``NgramRetriever`` and load desired model with desired arguments.
+
+.. code-block:: python
+
+	from ontolearner.learner import AutoRetrieverLearner
+	from ontolearner.learner.retriever import NgramRetriever
+
+    retriever = NgramRetriever(ngram_range=(1,2), stop_words='english')
+
+    learner = AutoRetrieverLearner(base_retriever=retriever)
+
+    learner.load(model_id="tfidf")  # or "count"
+
+.. note::
+
+	For desired arguments refer to `scikit-learn > TfidfVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html>`_ or `scikit-learn > CountVectorizer <https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html>`_
+
+Word2VecRetriever
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. sidebar:: How to Download Word2Vec?
+
+	Download the word2vec from `GoogleNews-vectors-negative300.bin.gz <https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g>`_ and then you can provide the path inside the ``.load(...)``.
+
+`Word2Vec <https://arxiv.org/abs/1301.3781>`_ retriever encode documents and queries using pre-trained word embeddings. Each document is represented by the average of its word vectors, and retrieval is done via cosine similarity between query vectors and document vectors. The following code shows how to use ``Word2VecRetriever`` inside learner model:
+
+.. code-block:: python
+
+    from ontolearner.learner import AutoRetrieverLearner
+    from ontolearner.learner.retriever import Word2VecRetriever
+
+    retriever = Word2VecRetriever()
+
+    learner = AutoRetrieverLearner(base_retriever=retriever)
+
+    learner.load(model_id="path/to/word2vec.bin")  # Load pre-trained Word2Vec vectors
+
+.. note::
+
+	Learn more about Word2Vec at `https://www.tensorflow.org/text/tutorials/word2vec <https://www.tensorflow.org/text/tutorials/word2vec>`_
+
+GloveRetriever
+~~~~~~~~~~~~~~~~~~~~~~~
+.. sidebar:: How to Download GloVe?
+
+	Download the desired GloVe models from `https://nlp.stanford.edu/projects/glove/ <https://nlp.stanford.edu/projects/glove/>`_ and then you can provide the path inside the ``.load(...)``.
+
+`GloVe <https://nlp.stanford.edu/projects/glove/>`_ is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Here, the ``GloveRetriever`` operates based on GloVe model as shown in the following:
+
+
+.. code-block:: python
+
+    from ontolearner.learner import AutoRetrieverLearner
+    from ontolearner.learner.retriever import GloveRetriever
+
+    retriever = GloveRetriever()
+
+    learner = AutoRetrieverLearner(base_retriever=retriever)
+
+    learner.load(model_id="path/to/glove.txt")  # Load pre-trained GloVe vectors
+
+
+.. hint::
+
+	In both **Word2Vec** and **GloVe** retrievers, If a word in a word is not in the embedding vocabulary, it is ignored.
+
+.. note::
+
+	Refer to the GloVe paper at `GloVe: Global Vectors for Word Representation <https://aclanthology.org/D14-1162/>`_ to learn more about this model.
+
+CrossEncoderRetriever
+~~~~~~~~~~~~~~~~~~~~~~~
+
+
+.. sidebar:: Cross-Encoder Models
+
+	Collections of publicly available cross-encoder models are available at: `🤗 Sentence Transformers - Cross-Encoders <https://huggingface.co/cross-encoder>`_.
+
+
+Untill now, the OntoLearner ``AutoRetriever`` (base retriever for ``AutoRetrieverLearner``) were using a Bi-Encoder architecture for retrievals. It is important to understand the difference between Bi- and Cross-Encoder. The following diagram shows the differences:
+
+.. raw:: html
+
+   <div align="center">
+     <img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/Bi_vs_Cross-Encoder.jpg" alt="Bi-Encoder vs Cross-Encoder " width="40%"/>
+   </div>
+   <br>
+
+
+Bi-Encoders produce for a given sentence a sentence embedding. We pass to a BERT independently the sentences A and B, which result in the sentence embeddings u and v. These sentence embedding can then be compared using cosine similarity. In contrast, for a Cross-Encoder, we pass both sentences simultaneously to the Transformer network. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair. A Cross-Encoder does not produce a sentence embedding. Also, we are not able to pass individual sentences to a Cross-Encoder (Reference: `Sentence-BERT > Cross-Encoder <https://sbert.net/examples/cross_encoder/applications/README.html>`_).
+
+
+Here, in the OntoLearner, we implemented a ``CrossEncoderRetriever``, a hybrid dense retriever that combines a BiEncoder for fast candidate retrieval and a CrossEncoder for accurate reranking. Overall ``CrossEncoderRetriever`` uses Bi-Encoder based model for retrieval and Cross-Encoder model for reranking. This provides an efficient and accurate alternative to pure Cross-Encoder or pure Bi-Encoder approaches. To use ``CrossEncoderRetriever`` simply follow the following steps:
+
+
+.. code-block:: python
+
+    from ontolearner.learner import AutoRetrieverLearner
+    from ontolearner.learner.retriever import CrossEncoderRetriever
+
+    retriever = CrossEncoderRetriever(bi_encoder_model_id='Qwen/Qwen3-Embedding-8B') # pass the bi-encoder model ID used in the first-stage
+
+    learner = AutoRetrieverLearner(base_retriever=retriever)
+
+    learner.load(model_id="cross-encoder/ms-marco-MiniLM-L12-v2")  # Model ID for the CrossEncoder (reranking model) here!
+                                                                   # When .load(...) is instantiated, both the bi-encoder and cross-encoder models will be loaded.
+
+
+.. note::
+
+	Learn more about Retrieve and Rerank approach at `Sentence Transformers > Usage > Retrieve & Re-Rank <https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html>`_.
+
+LLMAugmentedRetriever
+~~~~~~~~~~~~~~~~~~~~~~~~
+The LLM-Augmented retriever improves retrieval quality by expanding each query into multiple augmented variants using an LLM (e.g., GPT-4). The following diagram shows how LLM-Augmented retriever operates in comparison to usual retriever approach.
+
+
+.. raw:: html
+
+   <div align="center">
+     <img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/llm-augmenter.jpg" alt="LLM Augmented Retriever " width="80%"/>
+   </div>
+   <br>
+
+There are two usage modes:
+
+**1. Online augmentation (using LLMAugmenterGenerator):** This mode calls the LLM directly to generate augmentation candidates.
+
+.. code-block:: python
+
+	# Step 1 — Create the generator
+	from ontolearner.learner.retriever import LLMAugmenterGenerator
+	llm_augmenter_generator = LLMAugmenterGenerator(model_id='gpt-4.1-mini', token = '...', top_n_candidate=10)
+
+	# Step 2 — Generate augmentations for a dataset
+	tasks = ['term-typing', 'taxonomy-discovery', 'non-taxonomic-re']
+	augments = {"config": llm_augmenter_generator.get_config()}
+	for task in tasks:
+	    augments[task] = llm_augmenter_generator.augment(data, task=task)
+
+	# Step 3 — Save augmentations
+	from ontolearner.utils import save_json
+	save_json("augment.json", augments)
+
+The online augmentation is designed to avoid multiple calls to the models that may lead into expensive API usage and waiting time. Once the augmenter generator output is stored, it can be used for next stage.
+
+**2. Offline augmentation (recommended for large experiments):** Instead of calling the LLM repeatedly, you load the previously saved augmentations.
+
+
+.. code-block:: python
+
+	# Step 1 — Load augmenter
+	from ontolearner.learner.retriever import LLMAugmenter
+	augmenter = LLMAugmenter("augment.json")
+
+	# Step 2 — Attach it to the retriever
+	from ontolearner.learner.retriever import LLMAugmentedRetriever
+	from ontolearner.learner import LLMAugmentedRetrieverLearner
+
+	base_retriever = LLMAugmentedRetriever()
+	learner = LLMAugmentedRetrieverLearner(base_retriever=base_retriever)
+	learner.set_augmenter(augmenter)
+	learner.load(model_id="Qwen/Qwen3-Embedding-8B") # path to desired retriever model.
+
+Here the ``LLMAugmentedRetrieverLearner`` is the high-level wrapper that orchestrates the loading a retriever model, attaching the ``LLMAugmentedRetriever``, automatically applying LLM-based query expansion during training and prediction, and computing ground truth and returning predictions.
+
+
+
+.. list-table:: Summary of Components:
+   :header-rows: 1
+   :widths: 25 75
+
+   * - Component
+     - Purpose
+   * - ``LLMAugmenterGenerator``
+     - Calls an LLM (GPT-4, GPT-3.5, etc.) to generate augmentation data.
+   * - ``LLMAugmenter``
+     - Loads offline augmentations (``augment.json``).
+   * - ``LLMAugmentedRetriever``
+     - Expands each query using augmentations before retrieval.
+   * - ``LLMAugmentedRetrieverLearner``
+     - Applies the learner pipeline using the augmented retriever.
+
+.. rubric:: Example: Using LLMAugmentedRetrieverLearner for Taxonomy Discovery
+
+.. code-block:: python
+
+	from ontolearner.learner.retriever import LLMAugmenterGenerator, LLMAugmentedRetriever, LLMAugmenter
+	from ontolearner import LLMAugmentedRetrieverLearner, Wine, train_test_split, evaluation_report
+
+	ontology = Wine()
+	ontology.load()
+	ontological_data = ontology.extract()
+	train_data, test_data = train_test_split(ontological_data, test_size=0.2, random_state=42)
+
+	task="taxonomy-discovery"
+
+	llm_augmenter_generator = LLMAugmenterGenerator(model_id='gpt-4.1-mini', token = 'your_openai_token', top_n_candidate=10)
+	augments = {"config": llm_augmenter_generator.get_config()}
+	augments[task] = llm_augmenter_generator.augment(ontological_data, task=task)
+
+	learner.set_augmenter(augments)
+	learner.load(model_id="Qwen/Qwen3-Embedding-8B")
+
+	# Train, Predict, and Evaluate
+	learner.fit(train_data, task=task)
+	predictions = learner.predict(test_data, task=task)
+	truth = learner.tasks_ground_truth_former(test_data, task=task)
+	metrics = evaluation_report(truth, predictions, task=task)
+	print(metrics)