Skip to content

Commit 695da63

Browse files
authored
✨ add learners with documentations #288 (dev)
2 parents 675c5f1 + ac0c232 commit 695da63

35 files changed

+8676
-2
lines changed

docs/source/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -197,13 +197,14 @@ or GitHub repository:
197197
learning_tasks/text2onto
198198

199199
.. toctree::
200-
:maxdepth: 1
200+
:maxdepth: 4
201201
:caption: Learner Models
202202
:hidden:
203203

204204
learners/llm
205205
learners/retrieval
206206
learners/rag
207+
learners/llms4ol
207208

208209
.. toctree::
209210
:maxdepth: 4
108 KB
Loading
4.25 MB
Loading
112 KB
Loading

docs/source/learners/llms4ol.rst

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
2+
.. sidebar:: Challenge Series Websites
3+
4+
* `1st LLMs4OL @ ISWC 2024 <https://sites.google.com/view/llms4ol>`_
5+
* `2nd LLMs4OL @ ISWC 2025 <https://sites.google.com/view/llms4ol2025>`_
6+
7+
8+
.. raw:: html
9+
10+
<div align="center">
11+
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/challenge-logo.png" alt="challenge-logo" width="10%"/>
12+
</div>
13+
14+
LLMs4OL Challenge
15+
==================================================================================================================
16+
17+
18+
19+
20+
LLMs4OL is a community development initiative collocated with the International Semantic Web Conference (ISWC) to explore the potential of Large Language Models (LLMs) in Ontology Learning (OL), a vital process for enhancing the web with structured knowledge to improve interoperability. By leveraging LLMs, the challenge aims to advance understanding and innovation in OL, aligning with the goals of the Semantic Web to create a more intelligent and user-friendly web.
21+
22+
23+
.. list-table::
24+
:widths: 20 20 60
25+
:header-rows: 1
26+
27+
* - **Edition**
28+
- **Task**
29+
- **Description**
30+
* - ``LLMs4OL'25``
31+
- **Text2Onto**
32+
- Extract ontological terms and types from unstructured text.
33+
34+
**ID**: ``text-to-onto``
35+
36+
**Info**: This task focuses on extracting foundational elements (Terms and Types) from unstructured text documents to build the initial structure of an ontology. It involves recognizing domain-relevant vocabulary (Term Extraction, SubTask 1) and categorizing it appropriately (Type Extraction, SubTask 2). It bridges the gap between natural language and structured knowledge representation.
37+
38+
**Example**: **COVID-19** is a term of the type **Disease**.
39+
* - ``LLMs4OL'24``, ``LLMs4OL'25``
40+
- **Term Typing**
41+
- Discover the generalized type for a lexical term.
42+
43+
**ID**: ``term-typing``
44+
45+
**Info**: The process of assigning a generalized type to each lexical term involves mapping lexical items to their most appropriate semantic categories or ontological classes. For example, in the biomedical domain, the term ``aspirin`` should be classified under ``Pharmaceutical Drug``. This task is crucial for organizing extracted terms into structured ontologies and improving knowledge reuse.
46+
47+
**Example**: Assign the type ``"disease"`` to the term ``"myocardial infarction"``.
48+
* - ``LLMs4OL'24``, ``LLMs4OL'25``
49+
- **Taxonomy Discovery**
50+
- Discover the taxonomic hierarchy between type pairs.
51+
52+
**ID**: ``taxonomy-discovery``
53+
54+
**Info**: Taxonomy discovery focuses on identifying hierarchical relationships between types, enabling the construction of taxonomic structures (i.e., ``is-a`` relationships). Given a pair of terms or types, the task determines whether one is a subclass of the other. For example, discovering that ``Sedan is a subclass of Car`` contributes to structuring domain knowledge in a way that supports reasoning and inferencing in ontology-driven applications.
55+
56+
**Example**: Recognize that ``"lung cancer"`` is a subclass of ``"cancer"``, which is a subclass of ``"disease"``.
57+
* - ``LLMs4OL'24``, ``LLMs4OL'25``
58+
- **Non-Taxonomic Relation Extraction**
59+
- Identify non-taxonomic, semantic relations between types.
60+
61+
**ID**: ``non-taxonomic-re``
62+
63+
**Info**: This task aims to extract non-hierarchical (non-taxonomic) semantic relations between concepts in an ontology. Unlike taxonomy discovery, which deals with is-a relationships, this task focuses on other meaningful associations such as part-whole (part-of), causal (causes), functional (used-for), and associative (related-to) relationships. For example, in a medical ontology, discovering that ``Aspirin treats Headache`` adds valuable relational knowledge that enhances the utility of an ontology.
64+
65+
**Example**: Identify that *"virus"* ``causes`` *"infection"* or *"aspirin"* ``treats`` *"headache"*.
66+
67+
68+
.. note::
69+
70+
* Proceedings of 1st LLMs4OL Challenge @ ISWC 2024 available at `https://www.tib-op.org/ojs/index.php/ocp/issue/view/169 <https://www.tib-op.org/ojs/index.php/ocp/issue/view/169>`_
71+
* Proceedings of 2nd LLMs4OL Challenge @ ISWC 2025 available at `https://www.tib-op.org/ojs/index.php/ocp/issue/view/185 <https://www.tib-op.org/ojs/index.php/ocp/issue/view/185>`_
72+
73+
.. toctree::
74+
:maxdepth: 1
75+
:caption: LLMs4OL Challenge Series Participants Learners
76+
:titlesonly:
77+
78+
llms4ol_challenge/rwthdbis_learner
79+
llms4ol_challenge/skhnlp_learner
80+
llms4ol_challenge/alexbek_learner
81+
llms4ol_challenge/sbunlp_learner
Lines changed: 252 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,252 @@
1+
Alexbek Learner
2+
================
3+
4+
.. sidebar:: Alexbek Learner Examples
5+
6+
* Text2Onto: `llm_learner_alexbek_text2onto.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_text2onto.py>`_
7+
* Term Typing: `llm_learner_alexbek_rf_term_typing.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_rf_term_typing.py>`_
8+
* Taxonomy Discovery: `llm_learner_alexbek_cross_attn_taxonomy_discovery.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_cross_attn_taxonomy_discovery.py>`_
9+
10+
The team presented a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Their approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling — each tailored to the demands of the respective task.
11+
12+
.. note::
13+
14+
Read more about the model at `Alexbek at LLMs4OL 2025 Tasks A, B, and C: Heterogeneous LLM Methods for Ontology Learning (Few-Shot Prompting, Ensemble Typing, and Attention-Based Taxonomies) <https://www.tib-op.org/ojs/index.php/ocp/article/view/2899>`_.
15+
16+
.. hint::
17+
18+
The original implementation is available at `https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek <https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek>`_ repository.
19+
20+
Overview
21+
---------------------------------
22+
23+
.. raw:: html
24+
25+
<div align="center">
26+
<img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/alexbek-learner.png" alt="Alexbek Team" width="90%"/>
27+
</div>
28+
<br>
29+
30+
For **Task A (Text2Onto)**, they jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data is reformulated into a correspondence between documents, terms, and types, while test-time inference leverages semantically similar training examples. This single-pass method requires no model fine-tuning and leverages lexical augmentation. For **Task B (Term Typing)**, which involves assigning types to given terms, they adopt a dual strategy. In the few-shot setting (for domains with labeled training data), they reuse the RAG scheme with few-shot prompting. In the zero-shot or label-scarce setting, they use a classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting (e.g., via random forests or RAG-style retrieval). For **Task C (Taxonomy Discovery)**, they model taxonomy discovery as graph inference. Using embeddings of type labels, they train a lightweight cross-attention layer to predict *is-a* relations by approximating a soft adjacency matrix.
31+
32+
Methodological Summary:
33+
34+
1. **Retrieval-Augmented Text2Onto.** Training data is restructured into document–term–type correspondences. At inference time, the system retrieves semantically similar training examples and feeds them, together with the query document, into a small generative LLM to jointly predict candidate terms and their types.
35+
36+
2. **Hybrid Term Typing.**
37+
38+
* **Random-Forest Variant.** Uses dense text embeddings (and optionally graph-based features from the ontology) as input to a random-forest classifier, producing multi-label type assignments per term.
39+
* **RAG-Based Variant.** Combines a bi-encoder retriever with a generative LLM: for each query term, top-*k* labeled examples are retrieved and concatenated into the prompt. The LLM then predicts types in a structured format (e.g., JSON), which are parsed and evaluated.
40+
41+
3. **Cross-Attention Taxonomy Discovery.** Type labels (or term representations) are embedded using a sentence-transformer model and passed through a lightweight cross-attention layer. The resulting network approximates a soft adjacency matrix over types and is trained to distinguish positive (true parent–child) from negative (corrupted) edges.
42+
43+
44+
Term Typing (Random-Forest)
45+
---------------------------
46+
47+
Loading Ontological Data
48+
~~~~~~~~~~~~~~~~~~~~~~~~
49+
50+
For term typing, we use GeoNames as an example ontology. Labeled term–type pairs are extracted and split into train and test sets.
51+
52+
.. code-block:: python
53+
54+
from ontolearner import GeoNames, train_test_split
55+
56+
# Load the GeoNames ontology and extract labeled term-typing data
57+
ontology = GeoNames()
58+
ontology.load()
59+
data = ontology.extract()
60+
61+
# Split the labeled term-typing data into train and test sets
62+
train_data, test_data = train_test_split(
63+
data,
64+
test_size=0.2,
65+
random_state=42,
66+
)
67+
68+
Initialize Learner
69+
~~~~~~~~~~~~~~~~~~
70+
71+
Before defining the learner, choose the ontology learning task to perform.
72+
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
73+
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.
74+
75+
.. code-block:: python
76+
77+
task = "term-typing"
78+
79+
We first configure the Alexbek random-forest learner.
80+
This learner builds features from text embeddings (and optionally graph structure) and trains a random-forest classifier for term typing.
81+
82+
.. code-block:: python
83+
84+
from ontolearner.learner.term_typing import AlexbekRFLearner
85+
86+
rf_learner = AlexbekRFLearner(
87+
device="cpu", # switch to "cuda" if available
88+
batch_size=16,
89+
max_length=512, # max tokenizer length for embedding inputs
90+
threshold=0.30, # probability cutoff for assigning each type
91+
use_graph_features=True # set False for pure text-based features
92+
)
93+
94+
Learn and Predict
95+
~~~~~~~~~~~~~~~~~~~~~
96+
97+
.. code-block:: python
98+
99+
from ontolearner import evaluation_report
100+
# Fit the RF-based learner on the training split
101+
rf_learner.fit(train_data, task=task)
102+
103+
# Predict types for the held-out test terms
104+
predicts = rf_learner.predict(test_data, task=task)
105+
106+
# Build gold labels and evaluate
107+
truth = rf_learner.tasks_ground_truth_former(data=test_data, task=task)
108+
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
109+
print(metrics)
110+
111+
Term Typing (RAG-based)
112+
-----------------------
113+
114+
Loading Ontological Data
115+
~~~~~~~~~~~~~~~~~~~~~~~~
116+
117+
The RAG-based term-typing setup also uses GeoNames. We again load the ontology and split labeled term–type instances into train and test sets.
118+
119+
.. code-block:: python
120+
121+
from ontolearner import GeoNames, train_test_split
122+
123+
ontology = GeoNames()
124+
ontology.load()
125+
data = ontology.extract()
126+
127+
# Extract labeled items and split into train/test sets for evaluation
128+
train_data, test_data = train_test_split(
129+
data,
130+
test_size=0.2,
131+
random_state=42,
132+
)
133+
134+
Initialize Learner
135+
~~~~~~~~~~~~~~~~~~
136+
137+
Before defining the learner, choose the ontology learning task to perform.
138+
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
139+
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.
140+
141+
.. code-block:: python
142+
143+
task = "term-typing"
144+
145+
Next, we configure a Retrieval-Augmented Generation (RAG) term-typing classifier.
146+
An encoder retrieves top-k similar training examples, and a generative LLM predicts types conditioned on the query term plus retrieved examples.
147+
148+
.. code-block:: python
149+
150+
from ontolearner.learner.term_typing import AlexbekRAGLearner
151+
152+
rag_learner = AlexbekRAGLearner(
153+
llm_model_id="Qwen/Qwen2.5-0.5B-Instruct",
154+
retriever_model_id="sentence-transformers/all-MiniLM-L6-v2",
155+
device="cuda", # or "cpu"
156+
top_k=3,
157+
max_new_tokens=256,
158+
output_dir="./results/",
159+
)
160+
161+
# Load the underlying LLM and retriever for RAG-based term typing
162+
rag_learner.load(llm_id=rag_learner.llm_model_id)
163+
164+
Learn and Predict
165+
~~~~~~~~~~~~~~~~~~~~~
166+
167+
.. code-block:: python
168+
169+
from ontolearner import evaluation_report
170+
171+
# Index the training data for retrieval and prepare prompts
172+
rag_learner.fit(train_data, task=task)
173+
174+
# Predict types for the held-out test terms
175+
predicts = rag_learner.predict(test_data, task=task)
176+
177+
# Build gold labels and evaluate
178+
truth = rag_learner.tasks_ground_truth_former(data=test_data, task=task)
179+
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
180+
print(metrics)
181+
182+
183+
Taxonomy Discovery
184+
------------------
185+
186+
Loading Ontological Data
187+
~~~~~~~~~~~~~~~~~~~~~~~~
188+
189+
For taxonomy discovery, we again use the GeoNames ontology. It exposes parent–child relations that can be embedded and fed to a cross-attention model.
190+
191+
.. code-block:: python
192+
193+
from ontolearner import GeoNames, train_test_split
194+
195+
ontology = GeoNames()
196+
ontology.load()
197+
data = ontology.extract()
198+
199+
train_data, test_data = train_test_split(
200+
data,
201+
test_size=0.2,
202+
random_state=42,
203+
)
204+
205+
Initialize Learner
206+
~~~~~~~~~~~~~~~~~~
207+
208+
Before defining the learner, choose the ontology learning task to perform.
209+
Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_.
210+
The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``.
211+
212+
.. code-block:: python
213+
214+
task = "taxonomy-discovery"
215+
216+
Next, we configure the Alexbek cross-attention learner.
217+
It uses embeddings of type labels and a lightweight cross-attention layer to predict *is-a* relations.
218+
219+
.. code-block:: python
220+
221+
from ontolearner import AlexbekCrossAttnLearner
222+
223+
cross_learner = AlexbekCrossAttnLearner(
224+
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
225+
device="cpu",
226+
num_heads=8,
227+
lr=5e-5,
228+
weight_decay=0.01,
229+
num_epochs=1,
230+
batch_size=256,
231+
neg_ratio=1.0,
232+
output_dir="./results/crossattn/",
233+
seed=42,
234+
)
235+
236+
Learn and Predict
237+
~~~~~~~~~~~~~~~~~~~~~~
238+
239+
.. code-block:: python
240+
241+
from ontolearner import evaluation_report
242+
243+
# Train the cross-attention model on taxonomic edges
244+
cross_learner.fit(train_data, task=task)
245+
246+
# Predict taxonomic relations on the test set
247+
predicts = cross_learner.predict(test_data, task=task)
248+
249+
# Build gold labels and evaluate
250+
truth = cross_learner.tasks_ground_truth_former(data=test_data, task=task)
251+
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
252+
print(metrics)

0 commit comments

Comments
 (0)