Fine-tuning BERT models for clinical named entity recognition using the BC5CDR dataset.
-
Clone the repository
-
Install dependencies:
pip install -r requirements.txt
-
Copy and configure environment variables:
cp .env.example .env # Edit .env with your settings
Train the model:
python -m ner_trainer.train_cli trainOverride specific parameters:
python -m ner_trainer.train_cli train --epochs 3 --lr 2e-5All training parameters can be configured via environment variables or CLI arguments. See .env.example for available options.
This project uses the BC5CDR (BioCreative V Chemical Disease Relation) dataset, a high-quality corpus for biomedical named entity recognition and relation extraction.
The BC5CDR dataset contains:
- 1,500 PubMed articles manually annotated with chemicals, diseases, and chemical-induced disease relationships
- Training Set: 500 articles for model training
- Development Set: 500 articles for validation and hyperparameter tuning
- Test Set: 500 articles for final evaluation
- Entity Types: Chemical compounds and Disease mentions
- Formats: Available in both BioC XML and PubTator text formats
The dataset can be downloaded from the official BioCreative V CDR task:
- Primary Source: BioCreative V CDR Task
- Alternative: NCBI BioCreative V
ner_trainer/- Main training packagenotebooks/- Jupyter notebooks for experimentationdata/- Dataset files (BC5CDR corpus)
- Advanced checkpointing and resuming capabilities
- Multi-dataset support
- Hyperparameter optimization
- Model evaluation dashboards
data/CDR_Data/
├── CDR.Corpus.v010516/
│ ├── CDR_TrainingSet.BioC.xml # Training data (BioC format)
│ ├── CDR_TrainingSet.PubTator.txt # Training data (PubTator format)
│ ├── CDR_DevelopmentSet.BioC.xml # Validation data
│ ├── CDR_DevelopmentSet.PubTator.txt
│ ├── CDR_TestSet.BioC.xml # Test data
│ └── CDR_TestSet.PubTator.txt
├── DNorm.TestSet/ # Disease NER baseline results
├── tmChem.TestSet/ # Chemical NER baseline results
└── README.txt # Dataset documentation
If you use this dataset, please cite the following papers:
-
Wei CH, Peng Y, Leaman R, et al. Overview of the BioCreative V Chemical Disease Relation (CDR) Task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, p154-166, 2015.
-
Li J, Sun Y, Johnson RJ, et al. Annotating chemicals, diseases and their interactions in biomedical literature. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, p173-182, 2015.