Skip to content

ashwin2912/bert-entity-recognition-finetuning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERT Entity Recognition

Fine-tuning BERT models for clinical named entity recognition using the BC5CDR dataset.

Setup

  1. Clone the repository

  2. Install dependencies:

    pip install -r requirements.txt
  3. Copy and configure environment variables:

    cp .env.example .env
    # Edit .env with your settings

Usage

Train the model:

python -m ner_trainer.train_cli train

Override specific parameters:

python -m ner_trainer.train_cli train --epochs 3 --lr 2e-5

Configuration

All training parameters can be configured via environment variables or CLI arguments. See .env.example for available options.

Data

This project uses the BC5CDR (BioCreative V Chemical Disease Relation) dataset, a high-quality corpus for biomedical named entity recognition and relation extraction.

Dataset Overview

The BC5CDR dataset contains:

  • 1,500 PubMed articles manually annotated with chemicals, diseases, and chemical-induced disease relationships
  • Training Set: 500 articles for model training
  • Development Set: 500 articles for validation and hyperparameter tuning
  • Test Set: 500 articles for final evaluation
  • Entity Types: Chemical compounds and Disease mentions
  • Formats: Available in both BioC XML and PubTator text formats

Data Source

The dataset can be downloaded from the official BioCreative V CDR task:

Project Structure

  • ner_trainer/ - Main training package
  • notebooks/ - Jupyter notebooks for experimentation
  • data/ - Dataset files (BC5CDR corpus)

Future Scope

  • Advanced checkpointing and resuming capabilities
  • Multi-dataset support
  • Hyperparameter optimization
  • Model evaluation dashboards

Data Structure

data/CDR_Data/
├── CDR.Corpus.v010516/
│   ├── CDR_TrainingSet.BioC.xml      # Training data (BioC format)
│   ├── CDR_TrainingSet.PubTator.txt   # Training data (PubTator format)
│   ├── CDR_DevelopmentSet.BioC.xml    # Validation data
│   ├── CDR_DevelopmentSet.PubTator.txt
│   ├── CDR_TestSet.BioC.xml          # Test data
│   └── CDR_TestSet.PubTator.txt
├── DNorm.TestSet/                    # Disease NER baseline results
├── tmChem.TestSet/                   # Chemical NER baseline results
└── README.txt                        # Dataset documentation

Citation

If you use this dataset, please cite the following papers:

  1. Wei CH, Peng Y, Leaman R, et al. Overview of the BioCreative V Chemical Disease Relation (CDR) Task. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, p154-166, 2015.

  2. Li J, Sun Y, Johnson RJ, et al. Annotating chemicals, diseases and their interactions in biomedical literature. Proceedings of the Fifth BioCreative Challenge Evaluation Workshop, p173-182, 2015.

About

Fine-tuning BERT models for clinical named entity recognition using the BC5CDR dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors