Medical NER

Named Entity Recognition for Biomedical Text

Fine-tuned BiomedBERT on the BC5CDR corpus to detect Chemical and Disease entities in clinical and biomedical text.

Overview

This project implements an end-to-end pipeline for biomedical Named Entity Recognition (NER), covering data preparation, model training, evaluation, error analysis, ONNX optimisation, a production-ready REST API, and a static web frontend — all containerised with Docker.

The model identifies two entity types from clinical literature:

Tag	Description
Chemical	Drugs, compounds, and chemical substances
Disease	Diseases, disorders, symptoms, and medical conditions

Key results on BC5CDR test set (5 865 samples):

Entity	Precision	Recall	F1	Support
Chemical	91.74%	94.22%	92.96%	9 692
Disease	75.61%	90.02%	82.17%	2 772
Overall	87.72%	93.29%	90.42%	—

Project Structure

Medical-NER/
├── api/                        # Flask REST API
│   ├── main.py                 #   Application factory, routes, middleware
│   └── schemas.py              #   Pydantic request / response models
├── app/                        # Static web frontend (HTML/CSS/JS)
│   ├── index.html
│   ├── app.js
│   ├── style.css
│   └── assets/
├── config/
│   └── config.yaml             # Single-source training configuration
├── data/
│   ├── raw/                    # Raw downloads (auto-populated)
│   └── processed/              # Tokenized examples for inspection
├── export/
│   └── onnx_export.py          # ONNX export + latency benchmarking
├── notebooks/
│   └── exploration.py          # Data exploration notebook
├── outputs/
│   ├── logs/                   # TensorBoard event files
│   ├── models/                 # Checkpoints and best model
│   │   └── best/               #   Final production checkpoint
│   ├── onnx/                   # Exported ONNX model + benchmark JSON
│   └── results/                # Evaluation & error analysis JSON
├── scripts/
│   ├── train.py                # CLI: training
│   ├── evaluate.py             # CLI: evaluation
│   ├── predict.py              # CLI: single-text / batch inference
│   └── analyze_errors.py       # CLI: error analysis
├── src/
│   ├── data/
│   │   ├── dataset.py          # BC5CDR loading, tokenization, label alignment
│   │   ├── download.py         # Dataset download & statistics
│   │   ├── preprocessing.py    # Text normalisation utilities
│   │   └── augmentation.py     # Data augmentation strategies
│   ├── evaluation/
│   │   ├── evaluator.py        # Entity-level P/R/F1 with per-type breakdown
│   │   └── error_analysis.py   # FP, FN, boundary, and negation error analysis
│   ├── inference/
│   │   └── predict.py          # NERPredictor class (model → entity spans)
│   ├── models/
│   │   ├── ner_model.py        # BiomedBERT + classification head factory
│   │   └── layers.py           # Custom layers (CRF, attention pooling)
│   ├── training/
│   │   ├── trainer.py          # HuggingFace Trainer orchestration
│   │   └── metrics.py          # seqeval-based compute_metrics callback
│   └── utils/
│       ├── helpers.py          # Seed, config loading, device detection
│       └── logger.py           # Logging configuration
├── tests/
│   ├── test_model.py           # Model architecture & label mapping tests
│   ├── test_dataset.py         # Dataset loading & tokenization tests
│   └── test_inference.py       # Inference pipeline & entity decoding tests
├── .github/
│   └── workflows/
│       └── pages.yml           # GitHub Pages deployment for the web app
├── Dockerfile                  # Production API container
├── docker-compose.yml          # One-command deployment
├── requirements.txt            # Full development dependencies
├── requirements-api.txt        # Minimal API deployment dependencies
├── environment.yml             # Conda environment specification
└── setup.py                    # Package metadata

Dataset

BC5CDR (BioCreative V Chemical Disease Relation) is a manually annotated corpus of 1 500 PubMed articles with gold-standard Chemical and Disease mentions in IOB2 format.

Split	Samples	Avg. tokens	Chemical spans	Disease spans
Train	4 560	~25	5 203	4 182
Validation	4 581	~25	5 347	4 244
Test	5 865	~26	9 692	2 772

Source: tner/bc5cdr on HuggingFace
Original paper: Li et al., BioCreative V CDR task corpus: a resource for chemical disease relation extraction, Database, 2016
License: The BC5CDR corpus is distributed for research purposes by the BioCreative organisers. Refer to the BioCreative terms for usage conditions.

The dataset is downloaded automatically on first run via the HuggingFace datasets library. No manual setup is required.

Model

The backbone is BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext), a BERT model pre-trained on the full text of biomedical papers from PubMed. A linear token-classification head maps each subword representation to one of five IOB2 tags:

O  ·  B-Chemical  ·  I-Chemical  ·  B-Disease  ·  I-Disease

During tokenization, only the first subword piece of each word receives the original label; continuation subwords and special tokens are assigned -100 so the cross-entropy loss ignores them.

The fine-tuned model is hosted on the HuggingFace Hub: zaky17/medical-ner-model

Installation

Conda (recommended)

conda env create -f environment.yml
conda activate medical-ner
pip install -e .

pip

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Verify

python -c "from src.models.ner_model import build_model; print('OK')"

Training

Training is orchestrated by the HuggingFace Trainer with the following defaults (all configurable via config/config.yaml or CLI flags):

Hyperparameter	Value
Backbone	BiomedBERT (uncased)
Learning rate	3e-5
Optimiser	AdamW
Weight decay	0.01
Warmup steps	500
Epochs	5
Batch size	16
Max sequence length	512
Mixed precision	FP16 (CUDA only)
Early stopping	patience = 3 on val F1

# Default run (reads config/config.yaml)
python scripts/train.py

# Override hyperparameters from the command line
python scripts/train.py --lr 5e-5 --epochs 10 --batch-size 32

Checkpoints are saved to outputs/models/. The best model (by validation F1) is automatically saved to outputs/models/best/. TensorBoard logs are written to outputs/logs/.

tensorboard --logdir outputs/logs

Evaluation

Run entity-level evaluation on any split using the saved checkpoint:

python scripts/evaluate.py --checkpoint outputs/models/best
python scripts/evaluate.py --checkpoint outputs/models/best --split validation

Results are printed to stdout and saved as JSON to outputs/results/eval_<split>.json. Metrics are computed with seqeval at the entity level (strict matching) with a per-type breakdown.

Test set results:

              precision    recall  f1-score   support

    Chemical     0.9174    0.9422    0.9296      9692
     Disease     0.7561    0.9002    0.8217      2772

   micro avg     0.8772    0.9329    0.9042     12464

Inference

CLI

# Single sentence
python -m scripts.predict \
    --checkpoint outputs/models/best \
    --input "Aspirin can reduce the risk of heart disease."

# From a text file (one sentence per line)
python -m scripts.predict \
    --checkpoint outputs/models/best \
    --file data/raw/samples.txt \
    --output results.json

Python API

from src.inference.predict import NERPredictor

predictor = NERPredictor(checkpoint_dir="outputs/models/best")
entities = predictor.predict("Metformin is used to treat type 2 diabetes.")

for e in entities:
    print(f"[{e.label}] {e.text} (chars {e.start}–{e.end})")

[Chemical] Metformin (chars 0–9)
[Disease] type 2 diabetes (chars 29–44)

Error Analysis

A dedicated script performs four types of analysis on model predictions:

False positives — predicted entities with no matching gold span
False negatives — gold entities missed by the model
Boundary errors — partial overlap between predicted and gold spans
Negation errors — entities incorrectly tagged in negated contexts (e.g. "no evidence of diabetes")

python scripts/analyze_errors.py --checkpoint outputs/models/best --top-n 20

Results are written to outputs/results/error_analysis.json.

ONNX Export & Benchmarking

The fine-tuned model can be exported to ONNX for faster CPU inference with ONNX Runtime:

python -m export.onnx_export --checkpoint outputs/models/best

This exports the model to outputs/onnx/model.onnx and runs a latency benchmark (100 samples, CPU):

Metric	PyTorch	ONNX Runtime	Speedup
Mean (ms)	38.53	15.08	2.56×
Median (ms)	38.62	14.88	2.59×
P90 (ms)	42.72	16.32	2.62×
P99 (ms)	46.81	17.15	2.73×

Full results are saved to outputs/onnx/benchmark_results.json.

REST API

A production-ready Flask API serves the model behind a /predict endpoint.

Features

Rate limiting (60 req/min per client on /predict, 120/min globally)
CORS enabled for cross-origin requests
Pydantic input validation (max 10 000 characters)
Request-ID and response-time headers
Structured JSON error responses (400, 404, 413, 415, 422, 429, 500)
Health check endpoint at /health
1 MB request body limit

Running locally

python -m api.main --checkpoint outputs/models/best --port 8000

Endpoints

`GET /health`

{ "status": "ok", "model_loaded": true }

`POST /predict`

Request:

{ "text": "Aspirin can reduce the risk of heart disease." }

Response:

{
  "text": "Aspirin can reduce the risk of heart disease.",
  "entities": [
    { "text": "Aspirin", "label": "Chemical", "start": 0, "end": 7 },
    { "text": "heart disease", "label": "Disease", "start": 31, "end": 44 }
  ]
}

Web Application

A lightweight static frontend lets users paste or upload clinical text (.txt / .pdf) and view annotated results with colour-coded entity highlights.

The app is deployed to GitHub Pages via the workflow in .github/workflows/pages.yml and communicates with the hosted API.

Local usage: open app/index.html in a browser (the API URL is configured in app/app.js).

Docker

The Dockerfile builds a minimal CPU-only image that downloads the model from the HuggingFace Hub at build time.

Build and run

docker compose up --build

The API will be available at http://localhost:8000.

Configuration

Environment variable	Default	Description
`CHECKPOINT_DIR`	`outputs/models/best`	Path to the model directory
`DEVICE`	`cpu`	Inference device (`cpu` / `cuda`)
`PORT`	`8000`	Port the API listens on

The docker-compose.yml mounts ./outputs/models as a read-only volume so you can swap checkpoints without rebuilding the image. A health check pings /health every 30 seconds.

Tests

The test suite covers the model architecture, dataset pipeline, and inference logic:

Module	What it tests
`test_model.py`	Label constants, `build_model` output shape, config round-trip, save/load
`test_dataset.py`	Tokenization shape, label alignment (`-100` placement), no train/test leakage
`test_inference.py`	`Entity` dataclass, `_decode_entities` span merging, offset correctness

# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ -v --cov=src --cov-report=term-missing

Configuration

All training and model parameters live in config/config.yaml. CLI flags in scripts/train.py override any YAML value. The configuration is loaded via src/utils/helpers.load_config().

Full config reference

project:
  name: medical-ner
  seed: 42

data:
  raw_dir: data/raw
  processed_dir: data/processed
  max_seq_length: 512
  label_list: [O, B-Chemical, I-Chemical, B-Disease, I-Disease]

model:
  name: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  num_labels: 5
  dropout: 0.1

training:
  learning_rate: 3.0e-5
  epochs: 5
  batch_size: 16
  weight_decay: 0.01
  warmup_steps: 500
  fp16: true
  gradient_accumulation_steps: 1
  early_stopping_patience: 3
  output_dir: outputs/models
  logging_dir: outputs/logs
  save_total_limit: 2
  log_every_n_steps: 50

inference:
  device: auto
  batch_size: 32

Contributing

Contributions are welcome — whether it's a bug fix, a new feature, better docs, or a fresh idea.

Fork the repository
Create a branch for your feature or fix: git checkout -b feature/my-change
Make your changes and run the test suite: pytest tests/ -v
Commit with a clear message and push to your fork
Open a pull request against main describing what you changed and why

License

This project is licensed under the MIT License.

Note: the BC5CDR dataset used for training is distributed by the BioCreative organisers under its own research-use terms. The model weights are derived from that data. See the Dataset section for details.

Acknowledgements

BiomedBERT — Gu et al., Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, ACM CHIL 2021
BC5CDR — Li et al., BioCreative V CDR task corpus, Database, 2016
seqeval — entity-level evaluation for sequence labelling
HuggingFace Transformers — model training and inference backbone
ONNX Runtime — optimised inference on CPU

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
api		api
app		app
config		config
data		data
export		export
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
requirements-api.txt		requirements-api.txt
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Medical NER

Table of Contents

Overview

Project Structure

Dataset

Model

Installation

Conda (recommended)

pip

Verify

Training

Evaluation

Inference

CLI

Python API

Error Analysis

ONNX Export & Benchmarking

REST API

Features

Running locally

Endpoints

GET /health

POST /predict

Web Application

Docker

Build and run

Configuration

Tests

Configuration

Contributing

License

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

`GET /health`

`POST /predict`