Named Entity Recognition for Biomedical Text
Fine-tuned BiomedBERT on the BC5CDR corpus to detect Chemical and Disease entities in clinical and biomedical text.
- Overview
- Project Structure
- Dataset
- Model
- Installation
- Training
- Evaluation
- Inference
- Error Analysis
- ONNX Export & Benchmarking
- REST API
- Web Application
- Docker
- Tests
- Configuration
- Contributing
- License
- Acknowledgements
This project implements an end-to-end pipeline for biomedical Named Entity Recognition (NER), covering data preparation, model training, evaluation, error analysis, ONNX optimisation, a production-ready REST API, and a static web frontend — all containerised with Docker.
The model identifies two entity types from clinical literature:
| Tag | Description |
|---|---|
| Chemical | Drugs, compounds, and chemical substances |
| Disease | Diseases, disorders, symptoms, and medical conditions |
Key results on BC5CDR test set (5 865 samples):
| Entity | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Chemical | 91.74% | 94.22% | 92.96% | 9 692 |
| Disease | 75.61% | 90.02% | 82.17% | 2 772 |
| Overall | 87.72% | 93.29% | 90.42% | — |
Medical-NER/
├── api/ # Flask REST API
│ ├── main.py # Application factory, routes, middleware
│ └── schemas.py # Pydantic request / response models
├── app/ # Static web frontend (HTML/CSS/JS)
│ ├── index.html
│ ├── app.js
│ ├── style.css
│ └── assets/
├── config/
│ └── config.yaml # Single-source training configuration
├── data/
│ ├── raw/ # Raw downloads (auto-populated)
│ └── processed/ # Tokenized examples for inspection
├── export/
│ └── onnx_export.py # ONNX export + latency benchmarking
├── notebooks/
│ └── exploration.py # Data exploration notebook
├── outputs/
│ ├── logs/ # TensorBoard event files
│ ├── models/ # Checkpoints and best model
│ │ └── best/ # Final production checkpoint
│ ├── onnx/ # Exported ONNX model + benchmark JSON
│ └── results/ # Evaluation & error analysis JSON
├── scripts/
│ ├── train.py # CLI: training
│ ├── evaluate.py # CLI: evaluation
│ ├── predict.py # CLI: single-text / batch inference
│ └── analyze_errors.py # CLI: error analysis
├── src/
│ ├── data/
│ │ ├── dataset.py # BC5CDR loading, tokenization, label alignment
│ │ ├── download.py # Dataset download & statistics
│ │ ├── preprocessing.py # Text normalisation utilities
│ │ └── augmentation.py # Data augmentation strategies
│ ├── evaluation/
│ │ ├── evaluator.py # Entity-level P/R/F1 with per-type breakdown
│ │ └── error_analysis.py # FP, FN, boundary, and negation error analysis
│ ├── inference/
│ │ └── predict.py # NERPredictor class (model → entity spans)
│ ├── models/
│ │ ├── ner_model.py # BiomedBERT + classification head factory
│ │ └── layers.py # Custom layers (CRF, attention pooling)
│ ├── training/
│ │ ├── trainer.py # HuggingFace Trainer orchestration
│ │ └── metrics.py # seqeval-based compute_metrics callback
│ └── utils/
│ ├── helpers.py # Seed, config loading, device detection
│ └── logger.py # Logging configuration
├── tests/
│ ├── test_model.py # Model architecture & label mapping tests
│ ├── test_dataset.py # Dataset loading & tokenization tests
│ └── test_inference.py # Inference pipeline & entity decoding tests
├── .github/
│ └── workflows/
│ └── pages.yml # GitHub Pages deployment for the web app
├── Dockerfile # Production API container
├── docker-compose.yml # One-command deployment
├── requirements.txt # Full development dependencies
├── requirements-api.txt # Minimal API deployment dependencies
├── environment.yml # Conda environment specification
└── setup.py # Package metadata
BC5CDR (BioCreative V Chemical Disease Relation) is a manually annotated corpus of 1 500 PubMed articles with gold-standard Chemical and Disease mentions in IOB2 format.
| Split | Samples | Avg. tokens | Chemical spans | Disease spans |
|---|---|---|---|---|
| Train | 4 560 | ~25 | 5 203 | 4 182 |
| Validation | 4 581 | ~25 | 5 347 | 4 244 |
| Test | 5 865 | ~26 | 9 692 | 2 772 |
- Source: tner/bc5cdr on HuggingFace
- Original paper: Li et al., BioCreative V CDR task corpus: a resource for chemical disease relation extraction, Database, 2016
- License: The BC5CDR corpus is distributed for research purposes by the BioCreative organisers. Refer to the BioCreative terms for usage conditions.
The dataset is downloaded automatically on first run via the HuggingFace datasets library. No manual setup is required.
The backbone is BiomedBERT (microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext), a BERT model pre-trained on the full text of biomedical papers from PubMed. A linear token-classification head maps each subword representation to one of five IOB2 tags:
O · B-Chemical · I-Chemical · B-Disease · I-Disease
During tokenization, only the first subword piece of each word receives the original label; continuation subwords and special tokens are assigned -100 so the cross-entropy loss ignores them.
The fine-tuned model is hosted on the HuggingFace Hub: zaky17/medical-ner-model
conda env create -f environment.yml
conda activate medical-ner
pip install -e .python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
pip install -e .python -c "from src.models.ner_model import build_model; print('OK')"Training is orchestrated by the HuggingFace Trainer with the following defaults (all configurable via config/config.yaml or CLI flags):
| Hyperparameter | Value |
|---|---|
| Backbone | BiomedBERT (uncased) |
| Learning rate | 3e-5 |
| Optimiser | AdamW |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Epochs | 5 |
| Batch size | 16 |
| Max sequence length | 512 |
| Mixed precision | FP16 (CUDA only) |
| Early stopping | patience = 3 on val F1 |
# Default run (reads config/config.yaml)
python scripts/train.py
# Override hyperparameters from the command line
python scripts/train.py --lr 5e-5 --epochs 10 --batch-size 32Checkpoints are saved to outputs/models/. The best model (by validation F1) is automatically saved to outputs/models/best/. TensorBoard logs are written to outputs/logs/.
tensorboard --logdir outputs/logsRun entity-level evaluation on any split using the saved checkpoint:
python scripts/evaluate.py --checkpoint outputs/models/best
python scripts/evaluate.py --checkpoint outputs/models/best --split validationResults are printed to stdout and saved as JSON to outputs/results/eval_<split>.json. Metrics are computed with seqeval at the entity level (strict matching) with a per-type breakdown.
Test set results:
precision recall f1-score support
Chemical 0.9174 0.9422 0.9296 9692
Disease 0.7561 0.9002 0.8217 2772
micro avg 0.8772 0.9329 0.9042 12464
# Single sentence
python -m scripts.predict \
--checkpoint outputs/models/best \
--input "Aspirin can reduce the risk of heart disease."
# From a text file (one sentence per line)
python -m scripts.predict \
--checkpoint outputs/models/best \
--file data/raw/samples.txt \
--output results.jsonfrom src.inference.predict import NERPredictor
predictor = NERPredictor(checkpoint_dir="outputs/models/best")
entities = predictor.predict("Metformin is used to treat type 2 diabetes.")
for e in entities:
print(f"[{e.label}] {e.text} (chars {e.start}–{e.end})")[Chemical] Metformin (chars 0–9)
[Disease] type 2 diabetes (chars 29–44)
A dedicated script performs four types of analysis on model predictions:
- False positives — predicted entities with no matching gold span
- False negatives — gold entities missed by the model
- Boundary errors — partial overlap between predicted and gold spans
- Negation errors — entities incorrectly tagged in negated contexts (e.g. "no evidence of diabetes")
python scripts/analyze_errors.py --checkpoint outputs/models/best --top-n 20Results are written to outputs/results/error_analysis.json.
The fine-tuned model can be exported to ONNX for faster CPU inference with ONNX Runtime:
python -m export.onnx_export --checkpoint outputs/models/bestThis exports the model to outputs/onnx/model.onnx and runs a latency benchmark (100 samples, CPU):
| Metric | PyTorch | ONNX Runtime | Speedup |
|---|---|---|---|
| Mean (ms) | 38.53 | 15.08 | 2.56× |
| Median (ms) | 38.62 | 14.88 | 2.59× |
| P90 (ms) | 42.72 | 16.32 | 2.62× |
| P99 (ms) | 46.81 | 17.15 | 2.73× |
Full results are saved to outputs/onnx/benchmark_results.json.
A production-ready Flask API serves the model behind a /predict endpoint.
- Rate limiting (60 req/min per client on
/predict, 120/min globally) - CORS enabled for cross-origin requests
- Pydantic input validation (max 10 000 characters)
- Request-ID and response-time headers
- Structured JSON error responses (400, 404, 413, 415, 422, 429, 500)
- Health check endpoint at
/health - 1 MB request body limit
python -m api.main --checkpoint outputs/models/best --port 8000{ "status": "ok", "model_loaded": true }Request:
{ "text": "Aspirin can reduce the risk of heart disease." }Response:
{
"text": "Aspirin can reduce the risk of heart disease.",
"entities": [
{ "text": "Aspirin", "label": "Chemical", "start": 0, "end": 7 },
{ "text": "heart disease", "label": "Disease", "start": 31, "end": 44 }
]
}A lightweight static frontend lets users paste or upload clinical text (.txt / .pdf) and view annotated results with colour-coded entity highlights.
The app is deployed to GitHub Pages via the workflow in .github/workflows/pages.yml and communicates with the hosted API.
Local usage: open app/index.html in a browser (the API URL is configured in app/app.js).
The Dockerfile builds a minimal CPU-only image that downloads the model from the HuggingFace Hub at build time.
docker compose up --buildThe API will be available at http://localhost:8000.
| Environment variable | Default | Description |
|---|---|---|
CHECKPOINT_DIR |
outputs/models/best |
Path to the model directory |
DEVICE |
cpu |
Inference device (cpu / cuda) |
PORT |
8000 |
Port the API listens on |
The docker-compose.yml mounts ./outputs/models as a read-only volume so you can swap checkpoints without rebuilding the image. A health check pings /health every 30 seconds.
The test suite covers the model architecture, dataset pipeline, and inference logic:
| Module | What it tests |
|---|---|
test_model.py |
Label constants, build_model output shape, config round-trip, save/load |
test_dataset.py |
Tokenization shape, label alignment (-100 placement), no train/test leakage |
test_inference.py |
Entity dataclass, _decode_entities span merging, offset correctness |
# Run all tests
pytest tests/ -v
# With coverage
pytest tests/ -v --cov=src --cov-report=term-missingAll training and model parameters live in config/config.yaml. CLI flags in scripts/train.py override any YAML value. The configuration is loaded via src/utils/helpers.load_config().
Full config reference
project:
name: medical-ner
seed: 42
data:
raw_dir: data/raw
processed_dir: data/processed
max_seq_length: 512
label_list: [O, B-Chemical, I-Chemical, B-Disease, I-Disease]
model:
name: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
num_labels: 5
dropout: 0.1
training:
learning_rate: 3.0e-5
epochs: 5
batch_size: 16
weight_decay: 0.01
warmup_steps: 500
fp16: true
gradient_accumulation_steps: 1
early_stopping_patience: 3
output_dir: outputs/models
logging_dir: outputs/logs
save_total_limit: 2
log_every_n_steps: 50
inference:
device: auto
batch_size: 32Contributions are welcome — whether it's a bug fix, a new feature, better docs, or a fresh idea.
- Fork the repository
- Create a branch for your feature or fix:
git checkout -b feature/my-change - Make your changes and run the test suite:
pytest tests/ -v - Commit with a clear message and push to your fork
- Open a pull request against
maindescribing what you changed and why
This project is licensed under the MIT License.
Note: the BC5CDR dataset used for training is distributed by the BioCreative organisers under its own research-use terms. The model weights are derived from that data. See the Dataset section for details.
- BiomedBERT — Gu et al., Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, ACM CHIL 2021
- BC5CDR — Li et al., BioCreative V CDR task corpus, Database, 2016
- seqeval — entity-level evaluation for sequence labelling
- HuggingFace Transformers — model training and inference backbone
- ONNX Runtime — optimised inference on CPU

