|
| 1 | +Alexbek Learner |
| 2 | +================ |
| 3 | + |
| 4 | +.. sidebar:: Alexbek Learner Examples |
| 5 | + |
| 6 | + * Text2Onto: `llm_learner_alexbek_text2onto.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_text2onto.py>`_ |
| 7 | + * Term Typing: `llm_learner_alexbek_rf_term_typing.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_rf_term_typing.py>`_ |
| 8 | + * Taxonomy Discovery: `llm_learner_alexbek_cross_attn_taxonomy_discovery.py <https://github.com/sciknoworg/OntoLearner/blob/main/examples/llm_learner_alexbek_cross_attn_taxonomy_discovery.py>`_ |
| 9 | + |
| 10 | +The team presented a comprehensive system for addressing Tasks A, B, and C of the LLMs4OL 2025 challenge, which together span the full ontology construction pipeline: term extraction, typing, and taxonomy discovery. Their approach combines retrieval-augmented prompting, zero-shot classification, and attention-based graph modeling — each tailored to the demands of the respective task. |
| 11 | + |
| 12 | +.. note:: |
| 13 | + |
| 14 | + Read more about the model at `Alexbek at LLMs4OL 2025 Tasks A, B, and C: Heterogeneous LLM Methods for Ontology Learning (Few-Shot Prompting, Ensemble Typing, and Attention-Based Taxonomies) <https://www.tib-op.org/ojs/index.php/ocp/article/view/2899>`_. |
| 15 | + |
| 16 | +.. hint:: |
| 17 | + |
| 18 | + The original implementation is available at `https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek <https://github.com/BelyaevaAlex/LLMs4OL-Challenge-Alexbek>`_ repository. |
| 19 | + |
| 20 | +Overview |
| 21 | +--------------------------------- |
| 22 | + |
| 23 | +.. raw:: html |
| 24 | + |
| 25 | + <div align="center"> |
| 26 | + <img src="https://raw.githubusercontent.com/sciknoworg/OntoLearner/refs/heads/dev/docs/source/learners/images/alexbek-learner.png" alt="Alexbek Team" width="90%"/> |
| 27 | + </div> |
| 28 | + <br> |
| 29 | + |
| 30 | +For **Task A (Text2Onto)**, they jointly extract domain-specific terms and their ontological types using a retrieval-augmented generation (RAG) pipeline. Training data is reformulated into a correspondence between documents, terms, and types, while test-time inference leverages semantically similar training examples. This single-pass method requires no model fine-tuning and leverages lexical augmentation. For **Task B (Term Typing)**, which involves assigning types to given terms, they adopt a dual strategy. In the few-shot setting (for domains with labeled training data), they reuse the RAG scheme with few-shot prompting. In the zero-shot or label-scarce setting, they use a classifier that combines cosine similarity scores from multiple embedding models using confidence-based weighting (e.g., via random forests or RAG-style retrieval). For **Task C (Taxonomy Discovery)**, they model taxonomy discovery as graph inference. Using embeddings of type labels, they train a lightweight cross-attention layer to predict *is-a* relations by approximating a soft adjacency matrix. |
| 31 | + |
| 32 | +Methodological Summary: |
| 33 | + |
| 34 | +1. **Retrieval-Augmented Text2Onto.** Training data is restructured into document–term–type correspondences. At inference time, the system retrieves semantically similar training examples and feeds them, together with the query document, into a small generative LLM to jointly predict candidate terms and their types. |
| 35 | + |
| 36 | +2. **Hybrid Term Typing.** |
| 37 | + |
| 38 | + * **Random-Forest Variant.** Uses dense text embeddings (and optionally graph-based features from the ontology) as input to a random-forest classifier, producing multi-label type assignments per term. |
| 39 | + * **RAG-Based Variant.** Combines a bi-encoder retriever with a generative LLM: for each query term, top-*k* labeled examples are retrieved and concatenated into the prompt. The LLM then predicts types in a structured format (e.g., JSON), which are parsed and evaluated. |
| 40 | + |
| 41 | +3. **Cross-Attention Taxonomy Discovery.** Type labels (or term representations) are embedded using a sentence-transformer model and passed through a lightweight cross-attention layer. The resulting network approximates a soft adjacency matrix over types and is trained to distinguish positive (true parent–child) from negative (corrupted) edges. |
| 42 | + |
| 43 | + |
| 44 | +Term Typing (Random-Forest) |
| 45 | +--------------------------- |
| 46 | + |
| 47 | +Loading Ontological Data |
| 48 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 49 | + |
| 50 | +For term typing, we use GeoNames as an example ontology. Labeled term–type pairs are extracted and split into train and test sets. |
| 51 | + |
| 52 | +.. code-block:: python |
| 53 | +
|
| 54 | + from ontolearner import GeoNames, train_test_split |
| 55 | +
|
| 56 | + # Load the GeoNames ontology and extract labeled term-typing data |
| 57 | + ontology = GeoNames() |
| 58 | + ontology.load() |
| 59 | + data = ontology.extract() |
| 60 | +
|
| 61 | + # Split the labeled term-typing data into train and test sets |
| 62 | + train_data, test_data = train_test_split( |
| 63 | + data, |
| 64 | + test_size=0.2, |
| 65 | + random_state=42, |
| 66 | + ) |
| 67 | +
|
| 68 | +Initialize Learner |
| 69 | +~~~~~~~~~~~~~~~~~~ |
| 70 | + |
| 71 | +Before defining the learner, choose the ontology learning task to perform. |
| 72 | +Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_. |
| 73 | +The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``. |
| 74 | + |
| 75 | +.. code-block:: python |
| 76 | +
|
| 77 | + task = "term-typing" |
| 78 | +
|
| 79 | +We first configure the Alexbek random-forest learner. |
| 80 | +This learner builds features from text embeddings (and optionally graph structure) and trains a random-forest classifier for term typing. |
| 81 | + |
| 82 | +.. code-block:: python |
| 83 | +
|
| 84 | + from ontolearner.learner.term_typing import AlexbekRFLearner |
| 85 | +
|
| 86 | + rf_learner = AlexbekRFLearner( |
| 87 | + device="cpu", # switch to "cuda" if available |
| 88 | + batch_size=16, |
| 89 | + max_length=512, # max tokenizer length for embedding inputs |
| 90 | + threshold=0.30, # probability cutoff for assigning each type |
| 91 | + use_graph_features=True # set False for pure text-based features |
| 92 | + ) |
| 93 | +
|
| 94 | +Learn and Predict |
| 95 | +~~~~~~~~~~~~~~~~~~~~~ |
| 96 | + |
| 97 | +.. code-block:: python |
| 98 | +
|
| 99 | + from ontolearner import evaluation_report |
| 100 | + # Fit the RF-based learner on the training split |
| 101 | + rf_learner.fit(train_data, task=task) |
| 102 | +
|
| 103 | + # Predict types for the held-out test terms |
| 104 | + predicts = rf_learner.predict(test_data, task=task) |
| 105 | +
|
| 106 | + # Build gold labels and evaluate |
| 107 | + truth = rf_learner.tasks_ground_truth_former(data=test_data, task=task) |
| 108 | + metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task) |
| 109 | + print(metrics) |
| 110 | +
|
| 111 | +Term Typing (RAG-based) |
| 112 | +----------------------- |
| 113 | + |
| 114 | +Loading Ontological Data |
| 115 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 116 | + |
| 117 | +The RAG-based term-typing setup also uses GeoNames. We again load the ontology and split labeled term–type instances into train and test sets. |
| 118 | + |
| 119 | +.. code-block:: python |
| 120 | +
|
| 121 | + from ontolearner import GeoNames, train_test_split |
| 122 | +
|
| 123 | + ontology = GeoNames() |
| 124 | + ontology.load() |
| 125 | + data = ontology.extract() |
| 126 | +
|
| 127 | + # Extract labeled items and split into train/test sets for evaluation |
| 128 | + train_data, test_data = train_test_split( |
| 129 | + data, |
| 130 | + test_size=0.2, |
| 131 | + random_state=42, |
| 132 | + ) |
| 133 | +
|
| 134 | +Initialize Learner |
| 135 | +~~~~~~~~~~~~~~~~~~ |
| 136 | + |
| 137 | +Before defining the learner, choose the ontology learning task to perform. |
| 138 | +Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_. |
| 139 | +The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``. |
| 140 | + |
| 141 | +.. code-block:: python |
| 142 | +
|
| 143 | + task = "term-typing" |
| 144 | +
|
| 145 | +Next, we configure a Retrieval-Augmented Generation (RAG) term-typing classifier. |
| 146 | +An encoder retrieves top-k similar training examples, and a generative LLM predicts types conditioned on the query term plus retrieved examples. |
| 147 | + |
| 148 | +.. code-block:: python |
| 149 | +
|
| 150 | + from ontolearner.learner.term_typing import AlexbekRAGLearner |
| 151 | +
|
| 152 | + rag_learner = AlexbekRAGLearner( |
| 153 | + llm_model_id="Qwen/Qwen2.5-0.5B-Instruct", |
| 154 | + retriever_model_id="sentence-transformers/all-MiniLM-L6-v2", |
| 155 | + device="cuda", # or "cpu" |
| 156 | + top_k=3, |
| 157 | + max_new_tokens=256, |
| 158 | + output_dir="./results/", |
| 159 | + ) |
| 160 | +
|
| 161 | + # Load the underlying LLM and retriever for RAG-based term typing |
| 162 | + rag_learner.load(llm_id=rag_learner.llm_model_id) |
| 163 | +
|
| 164 | +Learn and Predict |
| 165 | +~~~~~~~~~~~~~~~~~~~~~ |
| 166 | + |
| 167 | +.. code-block:: python |
| 168 | +
|
| 169 | + from ontolearner import evaluation_report |
| 170 | +
|
| 171 | + # Index the training data for retrieval and prepare prompts |
| 172 | + rag_learner.fit(train_data, task=task) |
| 173 | +
|
| 174 | + # Predict types for the held-out test terms |
| 175 | + predicts = rag_learner.predict(test_data, task=task) |
| 176 | +
|
| 177 | + # Build gold labels and evaluate |
| 178 | + truth = rag_learner.tasks_ground_truth_former(data=test_data, task=task) |
| 179 | + metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task) |
| 180 | + print(metrics) |
| 181 | +
|
| 182 | +
|
| 183 | +Taxonomy Discovery |
| 184 | +------------------ |
| 185 | + |
| 186 | +Loading Ontological Data |
| 187 | +~~~~~~~~~~~~~~~~~~~~~~~~ |
| 188 | + |
| 189 | +For taxonomy discovery, we again use the GeoNames ontology. It exposes parent–child relations that can be embedded and fed to a cross-attention model. |
| 190 | + |
| 191 | +.. code-block:: python |
| 192 | +
|
| 193 | + from ontolearner import GeoNames, train_test_split |
| 194 | +
|
| 195 | + ontology = GeoNames() |
| 196 | + ontology.load() |
| 197 | + data = ontology.extract() |
| 198 | +
|
| 199 | + train_data, test_data = train_test_split( |
| 200 | + data, |
| 201 | + test_size=0.2, |
| 202 | + random_state=42, |
| 203 | + ) |
| 204 | +
|
| 205 | +Initialize Learner |
| 206 | +~~~~~~~~~~~~~~~~~~ |
| 207 | + |
| 208 | +Before defining the learner, choose the ontology learning task to perform. |
| 209 | +Available tasks have been described in `LLMs4OL Paradigms <https://ontolearner.readthedocs.io/learning_tasks/llms4ol.html>`_. |
| 210 | +The task IDs are: ``term-typing``, ``taxonomy-discovery``, ``non-taxonomic-re``. |
| 211 | + |
| 212 | +.. code-block:: python |
| 213 | +
|
| 214 | + task = "taxonomy-discovery" |
| 215 | +
|
| 216 | +Next, we configure the Alexbek cross-attention learner. |
| 217 | +It uses embeddings of type labels and a lightweight cross-attention layer to predict *is-a* relations. |
| 218 | + |
| 219 | +.. code-block:: python |
| 220 | +
|
| 221 | + from ontolearner import AlexbekCrossAttnLearner |
| 222 | +
|
| 223 | + cross_learner = AlexbekCrossAttnLearner( |
| 224 | + embedding_model="sentence-transformers/all-MiniLM-L6-v2", |
| 225 | + device="cpu", |
| 226 | + num_heads=8, |
| 227 | + lr=5e-5, |
| 228 | + weight_decay=0.01, |
| 229 | + num_epochs=1, |
| 230 | + batch_size=256, |
| 231 | + neg_ratio=1.0, |
| 232 | + output_dir="./results/crossattn/", |
| 233 | + seed=42, |
| 234 | + ) |
| 235 | +
|
| 236 | +Learn and Predict |
| 237 | +~~~~~~~~~~~~~~~~~~~~~~ |
| 238 | + |
| 239 | +.. code-block:: python |
| 240 | +
|
| 241 | + from ontolearner import evaluation_report |
| 242 | +
|
| 243 | + # Train the cross-attention model on taxonomic edges |
| 244 | + cross_learner.fit(train_data, task=task) |
| 245 | +
|
| 246 | + # Predict taxonomic relations on the test set |
| 247 | + predicts = cross_learner.predict(test_data, task=task) |
| 248 | +
|
| 249 | + # Build gold labels and evaluate |
| 250 | + truth = cross_learner.tasks_ground_truth_former(data=test_data, task=task) |
| 251 | + metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task) |
| 252 | + print(metrics) |
0 commit comments