Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -197,13 +197,14 @@ or GitHub repository:
learning_tasks/text2onto

.. toctree::
:maxdepth: 1
:maxdepth: 4
:caption: Learner Models
:hidden:

learners/llm
learners/retrieval
learners/rag
learners/llms4ol

.. toctree::
:maxdepth: 4
Expand Down
Binary file added docs/source/learners/images/alexbek-learner.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/learners/images/challenge-logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
81 changes: 81 additions & 0 deletions docs/source/learners/llms4ol.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@

.. sidebar:: Challenge Series Websites

* `1st LLMs4OL @ ISWC 2024 <https://sites.google.com/view/llms4ol>`_
* `2nd LLMs4OL @ ISWC 2025 <https://sites.google.com/view/llms4ol2025>`_


.. raw:: html

<div align="center">
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/challenge-logo.png" alt="challenge-logo" width="10%"/>
</div>

LLMs4OL Challenge
==================================================================================================================




LLMs4OL is a community development initiative collocated with the International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing the web with structured knowledge to improve interoperability. By leveraging LLMs, the challenge aims to advance understanding and innovation in OL, aligning with the goals of the Semantic Web to create a more intelligent and user-friendly web.


.. list-table::
:widths: 20 20 60
:header-rows: 1

* - **Edition**
- **Task**
- **Description**
* - ``LLMs4OL'25``
- **Text2Onto**
- Extract ontological terms and types from unstructured text.

**ID**: ``text-to-onto``

**Info**: This task focuses on extracting foundational elements (Terms and Types) from unstructured text documents to build the initial structure of an ontology. It involves recognizing domain-relevant vocabulary (Term Extraction, SubTask 1) and categorizing it appropriately (Type Extraction, SubTask 2). It bridges the gap between natural language and structured knowledge representation.

**Example**: **COVID-19** is a term of the type **Disease**.
* - ``LLMs4OL'24``, ``LLMs4OL'25``
- **Term Typing**
- Discover the generalized type for a lexical term.

**ID**: ``term-typing``

**Info**: The process of assigning a generalized type to each lexical term involves mapping lexical items to their most appropriate semantic categories or ontological classes. For example, in the biomedical domain, the term ``aspirin`` should be classified under ``Pharmaceutical Drug``. This task is crucial for organizing extracted terms into structured ontologies and improving knowledge reuse.

**Example**: Assign the type ``"disease"`` to the term ``"myocardial infarction"``.
* - ``LLMs4OL'24``, ``LLMs4OL'25``
- **Taxonomy Discovery**
- Discover the taxonomic hierarchy between type pairs.

**ID**: ``taxonomy-discovery``

**Info**: Taxonomy discovery focuses on identifying hierarchical relationships between types, enabling the construction of taxonomic structures (i.e., ``is-a`` relationships). Given a pair of terms or types, the task determines whether one is a subclass of the other. For example, discovering that ``Sedan is a subclass of Car`` contributes to structuring domain knowledge in a way that supports reasoning and inferencing in ontology-driven applications.

**Example**: Recognize that ``"lung cancer"`` is a subclass of ``"cancer"``, which is a subclass of ``"disease"``.
* - ``LLMs4OL'24``, ``LLMs4OL'25``
- **Non-Taxonomic Relation Extraction**
- Identify non-taxonomic, semantic relations between types.

**ID**: ``non-taxonomic-re``

**Info**: This task aims to extract non-hierarchical (non-taxonomic) semantic relations between concepts in an ontology. Unlike taxonomy discovery, which deals with is-a relationships, this task focuses on other meaningful associations such as part-whole (part-of), causal (causes), functional (used-for), and associative (related-to) relationships. For example, in a medical ontology, discovering that ``Aspirin treats Headache`` adds valuable relational knowledge that enhances the utility of an ontology.

**Example**: Identify that *"virus"* ``causes`` *"infection"* or *"aspirin"* ``treats`` *"headache"*.


.. note::

* Proceedings of 1st LLMs4OL Challenge @ ISWC 2024 available at `https://www.tib-op.org/ojs/index.php/ocp/issue/view/169 <https://www.tib-op.org/ojs/index.php/ocp/issue/view/169>`_
* Proceedings of 2nd LLMs4OL Challenge @ ISWC 2025 available at `https://www.tib-op.org/ojs/index.php/ocp/issue/view/185 <https://www.tib-op.org/ojs/index.php/ocp/issue/view/185>`_

.. toctree::
:maxdepth: 1
:caption: LLMs4OL Challenge Series Participants Learners
:titlesonly:

llms4ol_challenge/rwthdbis_learner
llms4ol_challenge/skhnlp_learner
llms4ol_challenge/alexbek_learner
llms4ol_challenge/sbunlp_learner
252 changes: 252 additions & 0 deletions docs/source/learners/llms4ol_challenge/alexbek_learner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
Alexbek Learner
================

.. sidebar:: Alexbek Learner Examples

* Text2Onto: `llm_learner_alexbek_text2onto.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_text2onto.py>`_
* Term Typing: `llm_learner_alexbek_rf_term_typing.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_rf_term_typing.py>`_
* Taxonomy Discovery: `llm_learner_alexbek_cross_attn_taxonomy_discovery.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_cross_attn_taxonomy_discovery.py>`_

The team presented a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Their approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling — each tailored to the demands of the respective task.

.. note::

Read more about the model at `Alexbek at LLMs4OL 2025 Tasks A, B, and C: Heterogeneous LLM Methods for Ontology Learning (Few-Shot Prompting, Ensemble Typing, and Attention-Based Taxonomies) <https://www.tib-op.org/ojs/index.php/ocp/article/view/2899>`_.

.. hint::

The original implementation is available at `https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek <https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek>`_ repository.

Overview
---------------------------------

.. raw:: html

<div align="center">
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/alexbek-learner.png" alt="Alexbek Team" width="90%"/>
</div>
<br>

For **Task A (Text2Onto)**, they jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data is reformulated into a correspondence between documents, terms, and types, while test-time inference leverages semantically similar training examples. This single-pass method requires no model fine-tuning and leverages lexical augmentation. For **Task B (Term Typing)**, which involves assigning types to given terms, they adopt a dual strategy. In the few-shot setting (for domains with labeled training data), they reuse the RAG scheme with few-shot prompting. In the zero-shot or label-scarce setting, they use a classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting (e.g., via random forests or RAG-style retrieval). For **Task C (Taxonomy Discovery)**, they model taxonomy discovery as graph inference. Using embeddings of type labels, they train a lightweight cross-attention layer to predict *is-a* relations by approximating a soft adjacency matrix.

Methodological Summary:

1. **Retrieval-Augmented Text2Onto.** Training data is restructured into document–term–type correspondences. At inference time, the system retrieves semantically similar training examples and feeds them, together with the query document, into a small generative LLM to jointly predict candidate terms and their types.

2. **Hybrid Term Typing.**

* **Random-Forest Variant.** Uses dense text embeddings (and optionally graph-based features from the ontology) as input to a random-forest classifier, producing multi-label type assignments per term.
* **RAG-Based Variant.** Combines a bi-encoder retriever with a generative LLM: for each query term, top-*k* labeled examples are retrieved and concatenated into the prompt. The LLM then predicts types in a structured format (e.g., JSON), which are parsed and evaluated.

3. **Cross-Attention Taxonomy Discovery.** Type labels (or term representations) are embedded using a sentence-transformer model and passed through a lightweight cross-attention layer. The resulting network approximates a soft adjacency matrix over types and is trained to distinguish positive (true parent–child) from negative (corrupted) edges.


Term Typing (Random-Forest)
---------------------------

Loading Ontological Data
~~~~~~~~~~~~~~~~~~~~~~~~

For term typing, we use GeoNames as an example ontology. Labeled term–type pairs are extracted and split into train and test sets.

.. code-block:: python

from ontolearner import GeoNames, train_test_split

# Load the GeoNames ontology and extract labeled term-typing data
ontology = GeoNames()
ontology.load()
data = ontology.extract()

# Split the labeled term-typing data into train and test sets
train_data, test_data = train_test_split(
data,
test_size=0.2,
random_state=42,
)

Initialize Learner
~~~~~~~~~~~~~~~~~~

Before defining the learner, choose the ontology learning task to perform.
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.

.. code-block:: python

task = "term-typing"

We first configure the Alexbek random-forest learner.
This learner builds features from text embeddings (and optionally graph structure) and trains a random-forest classifier for term typing.

.. code-block:: python

from ontolearner.learner.term_typing import AlexbekRFLearner

rf_learner = AlexbekRFLearner(
device="cpu", # switch to "cuda" if available
batch_size=16,
max_length=512, # max tokenizer length for embedding inputs
threshold=0.30, # probability cutoff for assigning each type
use_graph_features=True # set False for pure text-based features
)

Learn and Predict
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

from ontolearner import evaluation_report
# Fit the RF-based learner on the training split
rf_learner.fit(train_data, task=task)

# Predict types for the held-out test terms
predicts = rf_learner.predict(test_data, task=task)

# Build gold labels and evaluate
truth = rf_learner.tasks_ground_truth_former(data=test_data, task=task)
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
print(metrics)

Term Typing (RAG-based)
-----------------------

Loading Ontological Data
~~~~~~~~~~~~~~~~~~~~~~~~

The RAG-based term-typing setup also uses GeoNames. We again load the ontology and split labeled term–type instances into train and test sets.

.. code-block:: python

from ontolearner import GeoNames, train_test_split

ontology = GeoNames()
ontology.load()
data = ontology.extract()

# Extract labeled items and split into train/test sets for evaluation
train_data, test_data = train_test_split(
data,
test_size=0.2,
random_state=42,
)

Initialize Learner
~~~~~~~~~~~~~~~~~~

Before defining the learner, choose the ontology learning task to perform.
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.

.. code-block:: python

task = "term-typing"

Next, we configure a Retrieval-Augmented Generation (RAG) term-typing classifier.
An encoder retrieves top-k similar training examples, and a generative LLM predicts types conditioned on the query term plus retrieved examples.

.. code-block:: python

from ontolearner.learner.term_typing import AlexbekRAGLearner

rag_learner = AlexbekRAGLearner(
llm_model_id="Qwen/Qwen2.5-0.5B-Instruct",
retriever_model_id="sentence-transformers/all-MiniLM-L6-v2",
device="cuda", # or "cpu"
top_k=3,
max_new_tokens=256,
output_dir="./results/",
)

# Load the underlying LLM and retriever for RAG-based term typing
rag_learner.load(llm_id=rag_learner.llm_model_id)

Learn and Predict
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

from ontolearner import evaluation_report

# Index the training data for retrieval and prepare prompts
rag_learner.fit(train_data, task=task)

# Predict types for the held-out test terms
predicts = rag_learner.predict(test_data, task=task)

# Build gold labels and evaluate
truth = rag_learner.tasks_ground_truth_former(data=test_data, task=task)
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
print(metrics)


Taxonomy Discovery
------------------

Loading Ontological Data
~~~~~~~~~~~~~~~~~~~~~~~~

For taxonomy discovery, we again use the GeoNames ontology. It exposes parent–child relations that can be embedded and fed to a cross-attention model.

.. code-block:: python

from ontolearner import GeoNames, train_test_split

ontology = GeoNames()
ontology.load()
data = ontology.extract()

train_data, test_data = train_test_split(
data,
test_size=0.2,
random_state=42,
)

Initialize Learner
~~~~~~~~~~~~~~~~~~

Before defining the learner, choose the ontology learning task to perform.
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.

.. code-block:: python

task = "taxonomy-discovery"

Next, we configure the Alexbek cross-attention learner.
It uses embeddings of type labels and a lightweight cross-attention layer to predict *is-a* relations.

.. code-block:: python

from ontolearner import AlexbekCrossAttnLearner

cross_learner = AlexbekCrossAttnLearner(
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
device="cpu",
num_heads=8,
lr=5e-5,
weight_decay=0.01,
num_epochs=1,
batch_size=256,
neg_ratio=1.0,
output_dir="./results/crossattn/",
seed=42,
)

Learn and Predict
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

from ontolearner import evaluation_report

# Train the cross-attention model on taxonomic edges
cross_learner.fit(train_data, task=task)

# Predict taxonomic relations on the test set
predicts = cross_learner.predict(test_data, task=task)

# Build gold labels and evaluate
truth = cross_learner.tasks_ground_truth_former(data=test_data, task=task)
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
print(metrics)
Loading