Skip to content

soualahmohammedzakaria/Medical-NER

Repository files navigation

Medical NER

Medical NER

Named Entity Recognition for Biomedical Text

Fine-tuned BiomedBERT on the BC5CDR corpus to detect Chemical and Disease entities in clinical and biomedical text.

Python PyTorch Transformers Flask License


Web application screenshot

Table of Contents


Overview

This project implements an end-to-end pipeline for biomedical Named Entity Recognition (NER), covering data preparation, model training, evaluation, error analysis, ONNX optimisation, a production-ready REST API, and a static web frontend — all containerised with Docker.

The model identifies two entity types from clinical literature:

Tag Description
Chemical Drugs, compounds, and chemical substances
Disease Diseases, disorders, symptoms, and medical conditions

Key results on BC5CDR test set (5 865 samples):

Entity Precision Recall F1 Support
Chemical 91.74% 94.22% 92.96% 9 692
Disease 75.61% 90.02% 82.17% 2 772
Overall 87.72% 93.29% 90.42%

Project Structure

Medical-NER/
├── api/                        # Flask REST API
│   ├── main.py                 #   Application factory, routes, middleware
│   └── schemas.py              #   Pydantic request / response models
├── app/                        # Static web frontend (HTML/CSS/JS)
│   ├── index.html
│   ├── app.js
│   ├── style.css
│   └── assets/
├── config/
│   └── config.yaml             # Single-source training configuration
├── data/
│   ├── raw/                    # Raw downloads (auto-populated)
│   └── processed/              # Tokenized examples for inspection
├── export/
│   └── onnx_export.py          # ONNX export + latency benchmarking
├── notebooks/
│   └── exploration.py          # Data exploration notebook
├── outputs/
│   ├── logs/                   # TensorBoard event files
│   ├── models/                 # Checkpoints and best model
│   │   └── best/               #   Final production checkpoint
│   ├── onnx/                   # Exported ONNX model + benchmark JSON
│   └── results/                # Evaluation & error analysis JSON
├── scripts/
│   ├── train.py                # CLI: training
│   ├── evaluate.py             # CLI: evaluation
│   ├── predict.py              # CLI: single-text / batch inference
│   └── analyze_errors.py       # CLI: error analysis
├── src/
│   ├── data/
│   │   ├── dataset.py          # BC5CDR loading, tokenization, label alignment
│   │   ├── download.py         # Dataset download & statistics
│   │   ├── preprocessing.py    # Text normalisation utilities
│   │   └── augmentation.py     # Data augmentation strategies
│   ├── evaluation/
│   │   ├── evaluator.py        # Entity-level P/R/F1 with per-type breakdown
│   │   └── error_analysis.py   # FP, FN, boundary, and negation error analysis
│   ├── inference/
│   │   └── predict.py          # NERPredictor class (model → entity spans)
│   ├── models/
│   │   ├── ner_model.py        # BiomedBERT + classification head factory
│   │   └── layers.py           # Custom layers (CRF, attention pooling)
│   ├── training/
│   │   ├── trainer.py          # HuggingFace Trainer orchestration
│   │   └── metrics.py          # seqeval-based compute_metrics callback
│   └── utils/
│       ├── helpers.py          # Seed, config loading, device detection
│       └── logger.py           # Logging configuration
├── tests/
│   ├── test_model.py           # Model architecture & label mapping tests
│   ├── test_dataset.py         # Dataset loading & tokenization tests
│   └── test_inference.py       # Inference pipeline & entity decoding tests
├── .github/
│   └── workflows/
│       └── pages.yml           # GitHub Pages deployment for the web app
├── Dockerfile                  # Production API container
├── docker-compose.yml          # One-command deployment
├── requirements.txt            # Full development dependencies
├── requirements-api.txt        # Minimal API deployment dependencies
├── environment.yml             # Conda environment specification
└── setup.py                    # Package metadata

Dataset

BC5CDR (BioCreative V Chemical Disease Relation) is a manually annotated corpus of 1 500 PubMed articles with gold-standard Chemical and Disease mentions in IOB2 format.

Split Samples Avg. tokens Chemical spans Disease spans
Train 4 560 ~25 5 203 4 182
Validation 4 581 ~25 5 347 4 244
Test 5 865 ~26 9 692 2 772
  • Source: tner/bc5cdr on HuggingFace
  • Original paper: Li et al., BioCreative V CDR task corpus: a resource for chemical disease relation extraction, Database, 2016
  • License: The BC5CDR corpus is distributed for research purposes by the BioCreative organisers. Refer to the BioCreative terms for usage conditions.

The dataset is downloaded automatically on first run via the HuggingFace datasets library. No manual setup is required.


Model

The backbone is BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext), a BERT model pre-trained on the full text of biomedical papers from PubMed. A linear token-classification head maps each subword representation to one of five IOB2 tags:

O  ·  B-Chemical  ·  I-Chemical  ·  B-Disease  ·  I-Disease

During tokenization, only the first subword piece of each word receives the original label; continuation subwords and special tokens are assigned -100 so the cross-entropy loss ignores them.

The fine-tuned model is hosted on the HuggingFace Hub: zaky17/medical-ner-model


Installation

Conda (recommended)

conda env create -f environment.yml
conda activate medical-ner
pip install -e .

pip

python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .

Verify

python -c "from src.models.ner_model import build_model; print('OK')"

Training

Training is orchestrated by the HuggingFace Trainer with the following defaults (all configurable via config/config.yaml or CLI flags):

Hyperparameter Value
Backbone BiomedBERT (uncased)
Learning rate 3e-5
Optimiser AdamW
Weight decay 0.01
Warmup steps 500
Epochs 5
Batch size 16
Max sequence length 512
Mixed precision FP16 (CUDA only)
Early stopping patience = 3 on val F1
# Default run (reads config/config.yaml)
python scripts/train.py

# Override hyperparameters from the command line
python scripts/train.py --lr 5e-5 --epochs 10 --batch-size 32

Checkpoints are saved to outputs/models/. The best model (by validation F1) is automatically saved to outputs/models/best/. TensorBoard logs are written to outputs/logs/.

tensorboard --logdir outputs/logs

TensorBoard training curves


Evaluation

Run entity-level evaluation on any split using the saved checkpoint:

python scripts/evaluate.py --checkpoint outputs/models/best
python scripts/evaluate.py --checkpoint outputs/models/best --split validation

Results are printed to stdout and saved as JSON to outputs/results/eval_<split>.json. Metrics are computed with seqeval at the entity level (strict matching) with a per-type breakdown.

Test set results:

              precision    recall  f1-score   support

    Chemical     0.9174    0.9422    0.9296      9692
     Disease     0.7561    0.9002    0.8217      2772

   micro avg     0.8772    0.9329    0.9042     12464

Inference

CLI

# Single sentence
python -m scripts.predict \
    --checkpoint outputs/models/best \
    --input "Aspirin can reduce the risk of heart disease."

# From a text file (one sentence per line)
python -m scripts.predict \
    --checkpoint outputs/models/best \
    --file data/raw/samples.txt \
    --output results.json

Python API

from src.inference.predict import NERPredictor

predictor = NERPredictor(checkpoint_dir="outputs/models/best")
entities = predictor.predict("Metformin is used to treat type 2 diabetes.")

for e in entities:
    print(f"[{e.label}] {e.text} (chars {e.start}{e.end})")
[Chemical] Metformin (chars 0–9)
[Disease] type 2 diabetes (chars 29–44)

Error Analysis

A dedicated script performs four types of analysis on model predictions:

  1. False positives — predicted entities with no matching gold span
  2. False negatives — gold entities missed by the model
  3. Boundary errors — partial overlap between predicted and gold spans
  4. Negation errors — entities incorrectly tagged in negated contexts (e.g. "no evidence of diabetes")
python scripts/analyze_errors.py --checkpoint outputs/models/best --top-n 20

Results are written to outputs/results/error_analysis.json.


ONNX Export & Benchmarking

The fine-tuned model can be exported to ONNX for faster CPU inference with ONNX Runtime:

python -m export.onnx_export --checkpoint outputs/models/best

This exports the model to outputs/onnx/model.onnx and runs a latency benchmark (100 samples, CPU):

Metric PyTorch ONNX Runtime Speedup
Mean (ms) 38.53 15.08 2.56×
Median (ms) 38.62 14.88 2.59×
P90 (ms) 42.72 16.32 2.62×
P99 (ms) 46.81 17.15 2.73×

Full results are saved to outputs/onnx/benchmark_results.json.


REST API

A production-ready Flask API serves the model behind a /predict endpoint.

Features

  • Rate limiting (60 req/min per client on /predict, 120/min globally)
  • CORS enabled for cross-origin requests
  • Pydantic input validation (max 10 000 characters)
  • Request-ID and response-time headers
  • Structured JSON error responses (400, 404, 413, 415, 422, 429, 500)
  • Health check endpoint at /health
  • 1 MB request body limit

Running locally

python -m api.main --checkpoint outputs/models/best --port 8000

Endpoints

GET /health

{ "status": "ok", "model_loaded": true }

POST /predict

Request:

{ "text": "Aspirin can reduce the risk of heart disease." }

Response:

{
  "text": "Aspirin can reduce the risk of heart disease.",
  "entities": [
    { "text": "Aspirin", "label": "Chemical", "start": 0, "end": 7 },
    { "text": "heart disease", "label": "Disease", "start": 31, "end": 44 }
  ]
}

Web Application

A lightweight static frontend lets users paste or upload clinical text (.txt / .pdf) and view annotated results with colour-coded entity highlights.

The app is deployed to GitHub Pages via the workflow in .github/workflows/pages.yml and communicates with the hosted API.

Local usage: open app/index.html in a browser (the API URL is configured in app/app.js).


Docker

The Dockerfile builds a minimal CPU-only image that downloads the model from the HuggingFace Hub at build time.

Build and run

docker compose up --build

The API will be available at http://localhost:8000.

Configuration

Environment variable Default Description
CHECKPOINT_DIR outputs/models/best Path to the model directory
DEVICE cpu Inference device (cpu / cuda)
PORT 8000 Port the API listens on

The docker-compose.yml mounts ./outputs/models as a read-only volume so you can swap checkpoints without rebuilding the image. A health check pings /health every 30 seconds.


Tests

The test suite covers the model architecture, dataset pipeline, and inference logic:

Module What it tests
test_model.py Label constants, build_model output shape, config round-trip, save/load
test_dataset.py Tokenization shape, label alignment (-100 placement), no train/test leakage
test_inference.py Entity dataclass, _decode_entities span merging, offset correctness
# Run all tests
pytest tests/ -v

# With coverage
pytest tests/ -v --cov=src --cov-report=term-missing

Configuration

All training and model parameters live in config/config.yaml. CLI flags in scripts/train.py override any YAML value. The configuration is loaded via src/utils/helpers.load_config().

Full config reference
project:
  name: medical-ner
  seed: 42

data:
  raw_dir: data/raw
  processed_dir: data/processed
  max_seq_length: 512
  label_list: [O, B-Chemical, I-Chemical, B-Disease, I-Disease]

model:
  name: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  num_labels: 5
  dropout: 0.1

training:
  learning_rate: 3.0e-5
  epochs: 5
  batch_size: 16
  weight_decay: 0.01
  warmup_steps: 500
  fp16: true
  gradient_accumulation_steps: 1
  early_stopping_patience: 3
  output_dir: outputs/models
  logging_dir: outputs/logs
  save_total_limit: 2
  log_every_n_steps: 50

inference:
  device: auto
  batch_size: 32

Contributing

Contributions are welcome — whether it's a bug fix, a new feature, better docs, or a fresh idea.

  1. Fork the repository
  2. Create a branch for your feature or fix: git checkout -b feature/my-change
  3. Make your changes and run the test suite: pytest tests/ -v
  4. Commit with a clear message and push to your fork
  5. Open a pull request against main describing what you changed and why

License

This project is licensed under the MIT License.

Note: the BC5CDR dataset used for training is distributed by the BioCreative organisers under its own research-use terms. The model weights are derived from that data. See the Dataset section for details.


Acknowledgements

About

Fine-tuned BiomedBERT on the BC5CDR corpus to detect Chemical and Disease entities in clinical and biomedical text.

Topics

Resources

License

Stars

Watchers

Forks

Contributors