├── data
├── demo_nb
├── scripts
├── EDA
├── evaluation
├── src
│ ├── models
│ │ ├── init.py
│ │ ├── ontology_models.py
│ │ ├── ontology_mapper_rag.py
│ │ ├── ontology_mapper_lm.py
│ │ ├── ontology_mapper_st.py
│ │ ├── method_model.yaml
│ ├── Engine
│ │ ├── ontology_mapping_engine.py
│ │ ├── schema_mapping_engine.py
│ ├── CustomLogger
│ ├── KnowledgeDb
│ │ ├── faiss_sqlite_pipeline.py
│ │ └── db_clients
│ │ ├── nci_db.py
│ │ └── umls_db.py
│ ├── Plotter
├── setup.py
└── readme.md
└── LICENSEIn order to use schema and/Or ontology mapping functionality in metaharmonizer, please follow the steps below.
- First create a
conda create -n demo_env python=3.10 -y - Activate the environment as
conda activate demo_env - Install the dependencies
pip install -r requirements.txtafterpip install --upgrade pip
git clone https://github.com/shbrief/MetaHarmonizer
- The datasets in this repository are encrypted to prevent contamination of the gold standard.
- For ontology mapping, you must provide:
- A list of
query_terms - A list of
corpus_terms
- A list of
- For schema mapping, provide a clinical metadata file.
- The schema mapping dictionary is available in the
/datafolder.
- The schema mapping dictionary is available in the
⚠️ You will not be able to use the encrypted demo datasets without authorization, but you can supply your own query and corpus lists.
- Ontology Mapping
## Go into the correct directory by specifying the user_path where MetaHarmonizer was cloned
%cd <user_path>/MetaHarmonizer/
## Ontology Mapping
## Import required packages
import nest_asyncio
import pandas as pd
from importlib import reload
## Allow nested usage
nest_asyncio.apply()
## Import the models/engine for ontology mapping
from src.Engine import get_ontology_engine
from src.models import ontology_mapper_st as om_st
from src.models import ontology_mapper_lm as om_lm
from src.models import ontology_mapper_rag as om_rag
from src.models import ontology_mapper_bi_encoder as om_bi
## The reload() calls are optional, useful only if you are editing the code live in a notebook.
reload(om_st)
reload(om_lm)
reload(om_rag)
reload(om_bi)
OntoMapEngine = get_ontology_engine()
## Import useful utilities
from src.models.calc_stats import CalcStats # for calculating accuracy (testing)
from src.utils.cleanup_vector_store import cleanup_vector_store # for cleaning up the vector store
## Now you must initialize the engine
other_params = {"test_or_prod": "test"}
onto_engine_large = OntoMapEngine(method='sap-bert',
category='disease',
topk=5,
query=query_list,
corpus=large_corpus_list,
cura_map=cura_map,
om_strategy='lm',
**other_params)
lm_sapbert_disease_top5_result = onto_engine_large.run()
# for more examples, you can refer to demo_nb/ontology_mapper_workflow.ipynb
## Run the ontology mapping
results_engine_testing = onto_engine_large.run()-
Parameters that can be changed in the model:
- query(list): list of query terms (can be 1 or Many)
- corpus(list): list of corpus terms to match against
- query(df): df of query, for query enrichment in rag_bie strategy
- corpus(df): df of corpus, for concept retrieval in rag/rag_bie strategy
- om_strategy(str): 4 types of strategy are available
- strategy lm: Use [CLS] tokens for capturing the embedding representation. CLS is calculated in a much more intricate way, taking into account both its own embeddings (token/pos) as well as the context.
- strategy st: Sentence transformer based strategy use default embedding method.
- strategy rag: Combines retrieval from a knowledge database (e.g., FAISS + SQLite) with embedding-based similarity. Useful when the query requires additional context from a large corpus.
- strategy rag_bie: RAG with Bi-Encoder query enrichment. Still under development; may be merged into
ragin future releases.
- method(str): These are string keys that fetch the different transformer models found in the mapping method_model.yaml file.
- topk(int): Number of top matches to return for each query term in the query list
- other_params(dict): This is like a kwargs dictionary that currently only takes a value for the key test_or_prod. In the future if more parameters are added to the model, then it will be updated in this dictionary.
- cura_map(dict): Is a dictionary of paired query and ontology terms for evaluating or testing in the 'test' environment.
-
Output: Dataframe containing top 5 matches for each query term and their scores.
- Schema mapping
from src.Engine import get_schema_engine
SchemaMapEngine = get_schema_engine()
# Initialize the engine
engine = SchemaMapEngine(
clinical_data_path=YOUR_QUERY_FILE,
mode="manual", # Options: "auto" or "manual"
top_k=5,
)
# Run Stage 1, 2 & 3 (and 4 if mode="auto")
engine.run_schema_mapping()
# (Optional) Run Stage 4 after manual review
engine.run_stage3_from_manual("path_to_stage3_results.csv")-
Parameters that can be changed in the model:
- clinical_data_path (str): Path to clinical dataset (TSV or CSV).
- mode (str):
- "auto" → automatically proceed to Stage 4 if Stage 3 confidence is low
- "manual" → output Stage 3 results for review; Stage 4 must be triggered manually
- top_k (int): Number of top matches returned for each column.
-
Output
- CSV File: Results saved to data/schema_mapping_eval/ with suffix:
_schema_map_auto.csv for auto mode
_schema_map_manual.csv for manual mode
_schema_map_stage3.csv for Stage 3 results - Columns:
query
stage (stage1, stage2, stage3 )
method (dict, fuzzy, numeric, alias, bert, freq)
match{i}, match{i}_score, match{i}_source (for top-k matches)
- CSV File: Results saved to data/schema_mapping_eval/ with suffix:
The demo notebooks are located across /demo_nb folder
| Topic | Links | Resource Type |
|---|---|---|
| Review paper on all pretrained biomedical BERT models | Link | paper |
| Review of deep learning approaches for biomedical entity recognition | Link | paper |
| Comprehensive Review of pre-trained foundation models | Link | paper |
| KERMIT Knowledge graphs | Link | paper |
| LLMs4OM (Uses RAG Framework for matching concepts) | Link | paper |
| DeepOnto | Link | computational_tool |
| Text2Onto | Link | computational_tool |
| SapBert | Link | computational_tool |
| Ontology mapping with LLM’s | Link | computational_tool |
| Exploring LLM’s for ontology alignment | Link | computational_tool |
| Ontology alignment evaluation initiative | Link | dataset |
| Commonly used dataset for benchmarking of new methods | Link | dataset |
| NCIT Ontologies | Link | dataset |
| ML Friendly datasets for equivalence and subsumption mapping | Link | dataset |
| Positive and Negative Sampling Strategies for Representation Learning in Semantic Search | Link | blog |
| How to train sentence transformers | Link | blog |