Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
a63f6b2
add rwth_dbis learner models
Krishna-Rani-t Oct 22, 2025
1645709
added skhnlp learner models
Krishna-Rani-t Oct 29, 2025
844de4f
adding sbunlp learner models
Krishna-Rani-t Oct 29, 2025
be80e73
alexbek learner models
Krishna-Rani-t Nov 3, 2025
1abbbc9
added changes for taxonomy discovery and term typing
Krishna-Rani-t Nov 10, 2025
ec23135
removing changes from __init__.py files
Krishna-Rani-t Nov 11, 2025
2d49d94
Changes removed from requirements.txt
Krishna-Rani-t Nov 11, 2025
df6513d
updated __init__.py files and dependencies
Krishna-Rani-t Nov 12, 2025
47fd865
Update pyproject.toml
Krishna-Rani-t Nov 12, 2025
f658055
removed unnecessary changes from __init__.py
Krishna-Rani-t Nov 13, 2025
6323993
Merge branch 'learner-dev' into dev
Krishna-Rani-t Dec 20, 2025
6bdb68b
Merge pull request #296 from sciknoworg/dev
Krishna-Rani-t Dec 20, 2025
b843218
Add Text2onto learner models with documentation
Krishna-Rani-t Dec 21, 2025
686b3d2
:recycle: minor refactoring
HamedBabaei Jan 2, 2026
1dc59b3
Merge remote-tracking branch 'origin/dev' into dev
HamedBabaei Jan 2, 2026
a333d8e
:recycle: refactor augmented learners
HamedBabaei Jan 3, 2026
f219154
:sparkles: added OS compatibility test CI/CD
HamedBabaei Jan 3, 2026
a94c529
:sparkles: added text2onto based learners from challenge (PR #297)
HamedBabaei Jan 5, 2026
76d07db
:bookmark: v1.4.11
HamedBabaei Jan 5, 2026
883b254
:bug: bitsandbytes version fix
HamedBabaei Jan 5, 2026
1bd76fe
:test_tube:
HamedBabaei Jan 5, 2026
6838b82
:test_tube:
HamedBabaei Jan 5, 2026
da8e5b0
:test_tube:
HamedBabaei Jan 5, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions .github/workflows/test-os-compatibility.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: Cross-platform Compatibility Tests

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
os-compatibility-tests:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ubuntu-latest, windows-latest, macos-latest]
python-version: ["3.10"]

steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install Poetry
shell: bash
run: |
curl -sSL https://install.python-poetry.org | python -
echo "$HOME/.local/bin" >> $GITHUB_PATH
echo "$APPDATA/Python/Scripts" >> $GITHUB_PATH

- name: Configure Poetry and install plugin
shell: bash
run: |
poetry --version
poetry config virtualenvs.create false
poetry self add "poetry-dynamic-versioning[plugin]"

- name: Install dependencies
shell: bash
run: |
poetry install --no-interaction --no-ansi

- name: Run tests
shell: bash
run: |
poetry run pytest
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
## Changelog

### v1.4.11 (Janouary 5, 2026)
- Add `text2onto` component for challenge learners with their documentation.
- Code refactoring
- OS compatibility CI/CD

### v1.4.10 (December 8, 2025)
- add complexity score
- add documentation for metrics
Expand Down
2 changes: 1 addition & 1 deletion docs/source/learners/llms4ol.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ LLMs4OL is a community development initiative collocated with the International
- **Text2Onto**
- Extract ontological terms and types from unstructured text.

**ID**: ``text-to-onto``
**ID**: ``text2onto``

**Info**: This task focuses on extracting foundational elements (Terms and Types) from unstructured text documents to build the initial structure of an ontology. It involves recognizing domain-relevant vocabulary (Term Extraction, SubTask 1) and categorizing it appropriately (Type Extraction, SubTask 2). It bridges the gap between natural language and structured knowledge representation.

Expand Down
144 changes: 144 additions & 0 deletions docs/source/learners/llms4ol_challenge/alexbek_learner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -250,3 +250,147 @@ Learn and Predict
truth = cross_learner.tasks_ground_truth_former(data=test_data, task=task)
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
print(metrics)

Text2Onto
------------------

Loading Ontological Data
~~~~~~~~~~~~~~~~~~~~~~

For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and then generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).

.. code-block:: python

import os
import dspy

# Ontology loader/manager
from ontolearner.ontology import OM

# Text2Onto utilities: synthetic generation + dataset splitting
from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter

# ---- DSPy -> Ollama (LiteLLM-style) ----
LLM_MODEL_ID = "ollama/llama3.2:3b" # use your pulled Ollama model
LLM_API_KEY = "NA" # local Ollama doesn't use a key
LLM_BASE_URL = "http://localhost:11434" # default Ollama endpoint

dspy_llm = dspy.LM(
model=LLM_MODEL_ID,
cache=True,
max_tokens=4000,
temperature=0,
api_key=LLM_API_KEY,
base_url=LLM_BASE_URL,
)
dspy.configure(lm=dspy_llm)

# ---- Synthetic generation configuration ----
pseudo_sentence_batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
max_worker_count_for_llm_calls = int(os.getenv("TEXT2ONTO_WORKERS", "1"))

text2onto_synthetic_generator = SyntheticGenerator(
batch_size=pseudo_sentence_batch_size,
worker_count=max_worker_count_for_llm_calls,
)

# ---- Load ontology and extract structured data ----
ontology = OM()
ontology.load()
ontological_data = ontology.extract()

print(f"term types: {len(ontological_data.term_typings)}")
print(f"taxonomic relations: {len(ontological_data.type_taxonomies.taxonomies)}")
print(f"non-taxonomic relations: {len(ontological_data.type_non_taxonomic_relations.non_taxonomies)}")

# ---- Generate synthetic Text2Onto samples ----
synthetic_data = text2onto_synthetic_generator.generate(
ontological_data=ontological_data,
topic=ontology.domain,
)

Split Synthetic Data
~~~~~~~~~~~~~~~~~~~~

We split the synthetic dataset into train/val/test sets using ``SyntheticDataSplitter``.
Each split is a dict with keys:

- ``documents``
- ``terms``
- ``types``
- ``terms2docs``
- ``terms2types``

.. code-block:: python

splitter = SyntheticDataSplitter(
synthetic_data=synthetic_data,
onto_name=ontology.ontology_id,
)

train_data, val_data, test_data = splitter.train_test_val_split(
train=0.8,
val=0.0,
test=0.2,
)

print("TRAIN sizes:")
print(" documents:", len(train_data.get("documents", [])))
print(" terms:", len(train_data.get("terms", [])))
print(" types:", len(train_data.get("types", [])))
print(" terms2docs:", len(train_data.get("terms2docs", {})))
print(" terms2types:", len(train_data.get("terms2types", {})))

print("TEST sizes:")
print(" documents:", len(test_data.get("documents", [])))
print(" terms:", len(test_data.get("terms", [])))
print(" types:", len(test_data.get("types", [])))
print(" terms2docs:", len(test_data.get("terms2docs", {})))
print(" terms2types:", len(test_data.get("terms2types", {})))

Initialize Learner
~~~~~~~~~~~~~~~~~~

We configure a retrieval-augmented few-shot learner for the Text2Onto task.
The learner retrieves relevant synthetic examples and uses an LLM to predict structured outputs.

.. code-block:: python

from ontolearner.learner.text2onto import AlexbekRAGFewShotLearner

text2onto_learner = AlexbekRAGFewShotLearner(
llm_model_id="Qwen/Qwen2.5-0.5B-Instruct",
retriever_model_id="sentence-transformers/all-MiniLM-L6-v2",
device="cpu", # set "cuda" if available
top_k=3,
max_new_tokens=256,
use_tfidf=True,
)

Learn and Predict
~~~~~~~~~~~~~~~~~

We run the end-to-end pipeline (train -> predict -> evaluate) with ``LearnerPipeline`` using the ``text2onto`` task id.

.. code-block:: python

from ontolearner import LearnerPipeline

task = "text2onto"

pipe = LearnerPipeline(
llm=text2onto_learner,
llm_id="Qwen/Qwen2.5-0.5B-Instruct",
ontologizer_data=False,
)

outputs = pipe(
train_data=train_data,
test_data=test_data,
task=task,
evaluate=True,
ontologizer_data=False,
)

print("Metrics:", outputs.get("metrics"))
print("Elapsed time:", outputs.get("elapsed_time"))
146 changes: 146 additions & 0 deletions docs/source/learners/llms4ol_challenge/sbunlp_learner.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ Methodological Summary:

- For **Taxonomy Discovery**, the focus was on detecting parent–child relationships between ontology terms. Due to the relational nature of this task, batch prompting was employed to efficiently handle multiple type pairs per inference, enabling the model to consider several candidate relations jointly.

- For **Text2Onto**, the objective was to extract ontology construction signals from text-like inputs: generating/using documents, identifying candidate terms, assigning types, and producing supporting mappings such as term–document and term–type associations. In OntoLearner, this is implemented by first generating synthetic pseudo-documents from an ontology (using an LLM-backed synthetic generator), then applying the SBU-NLP prompting strategy to infer structured outputs without any fine-tuning. Dataset splitting and optional Ontologizer-style processing are used to support reproducible evaluation and artifact generation.

Term Typing
-----------------------

Expand Down Expand Up @@ -179,3 +181,147 @@ Learn and Predict
# Evaluate taxonomy discovery performance
metrics = evaluation_report(y_true=truth, y_pred=predicts, task=task)
print(metrics)

Text2Onto
------------------

Loading Ontological Data
~~~~~~~~~~~~~~~~~~~~~~

For the Text2Onto task, we load an ontology (via ``OM``), extract its structured content, and generate synthetic pseudo-sentences using an LLM-backed generator (DSPy + Ollama in this example).

.. code-block:: python

import os
import dspy

# Import ontology loader/manager and Text2Onto utilities
from ontolearner.ontology import OM
from ontolearner.text2onto import SyntheticGenerator, SyntheticDataSplitter

# ---- DSPy -> Ollama (LiteLLM-style) ----
LLM_MODEL_ID = "ollama/llama3.2:3b"
LLM_API_KEY = "NA" # local Ollama doesn't use a key
LLM_BASE_URL = "http://localhost:11434" # default Ollama endpoint

dspy_llm = dspy.LM(
model=LLM_MODEL_ID,
cache=True,
max_tokens=4000,
temperature=0,
api_key=LLM_API_KEY,
base_url=LLM_BASE_URL,
)
dspy.configure(lm=dspy_llm)

# ---- Synthetic generation configuration ----
batch_size = int(os.getenv("TEXT2ONTO_BATCH", "10"))
worker_count = int(os.getenv("TEXT2ONTO_WORKERS", "1"))

text2onto_synthetic_generator = SyntheticGenerator(
batch_size=batch_size,
worker_count=worker_count,
)

# ---- Load ontology and extract structured data ----
ontology = OM()
ontology.load()
ontological_data = ontology.extract()

# Optional sanity checks to verify what was extracted from the ontology
print(f"term types: {len(ontological_data.term_typings)}")
print(f"taxonomic relations: {len(ontological_data.type_taxonomies.taxonomies)}")
print(f"non-taxonomic relations: {len(ontological_data.type_non_taxonomic_relations.non_taxonomies)}")

# ---- Generate synthetic Text2Onto samples ----
synthetic_data = text2onto_synthetic_generator.generate(
ontological_data=ontological_data,
topic=ontology.domain,
)

Split Synthetic Data
~~~~~~~~~~~~~~~~~~~~

We split the synthetic dataset into train/val/test sets using ``SyntheticDataSplitter``.
Each split is a dict with keys:

- ``documents``
- ``terms``
- ``types``
- ``terms2docs``
- ``terms2types``

.. code-block:: python

splitter = SyntheticDataSplitter(
synthetic_data=synthetic_data,
onto_name=ontology.ontology_id,
)

train_data, val_data, test_data = splitter.train_test_val_split(
train=0.8,
val=0.0,
test=0.2,
)

print("TRAIN sizes:")
print(" documents:", len(train_data.get("documents", [])))
print(" terms:", len(train_data.get("terms", [])))
print(" types:", len(train_data.get("types", [])))
print(" terms2docs:", len(train_data.get("terms2docs", {})))
print(" terms2types:", len(train_data.get("terms2types", {})))

print("TEST sizes:")
print(" documents:", len(test_data.get("documents", [])))
print(" terms:", len(test_data.get("terms", [])))
print(" types:", len(test_data.get("types", [])))
print(" terms2docs:", len(test_data.get("terms2docs", {})))
print(" terms2types:", len(test_data.get("terms2types", {})))

Initialize Learner
~~~~~~~~~~~~~~~~~~

We configure the SBU-NLP few-shot learner for the Text2Onto task.
This learner uses an LLM to produce predictions from the synthetic Text2Onto-style samples.

.. code-block:: python

from ontolearner.learner.text2onto import SBUNLPFewShotLearner

text2onto_learner = SBUNLPFewShotLearner(
llm_model_id="Qwen/Qwen2.5-0.5B-Instruct",
device="cpu", # set "cuda" if available
max_new_tokens=256,
output_dir="./results/",
)

Learn and Predict
~~~~~~~~~~~~~~~~~

We run the end-to-end pipeline (train -> predict -> evaluate) with ``LearnerPipeline`` using the ``text2onto`` task id.

.. code-block:: python

from ontolearner import LearnerPipeline

task = "text2onto"

pipe = LearnerPipeline(
llm=text2onto_learner,
llm_id="Qwen/Qwen2.5-0.5B-Instruct",
ontologizer_data=False,
)

outputs = pipe(
train_data=train_data,
test_data=test_data,
task=task,
evaluate=True,
ontologizer_data=True,
)

print("Metrics:", outputs.get("metrics"))
print("Elapsed time:", outputs.get("elapsed_time"))

# Print all returned outputs (often includes predictions/artifacts/logs)
print(outputs)
Loading