This repository contains an implementation of the CurriculumDocRE approach for document-level relation extraction with curriculum learning.
The CurriculumDocRE approach implements a curriculum learning strategy for document-level relation extraction, where the model progressively learns from:
- Simple relations (entities in same/adjacent sentences)
- Multi-hop relations (entities in different sentences requiring connecting information)
- Complex relations (entities across paragraphs requiring advanced coreference) -- More explaination requires
-
Curriculum-Aware Attention:
- Dynamic attention patterns based on curriculum stage
- Entity-focused attention mechanisms
- Stage-specific adapters for different relation complexities
-
Data Quality Enhancement:
- Annotation consistency checking
- Confidence scoring based on multiple factors
- Active learning for uncertain instances
-
Data Augmentation:
- Entity substitution with contextual awareness
- Evidence masking for improved robustness
- Cross-document relation transfer
-
Optimized Training Pipeline:
- Integrated components in a cohesive flow
- Curriculum-aware loss functions
- Confidence-weighted evaluation metrics
-
Optimized Evaluation:
- Stage-specific thresholds for improved Ing F1 score
- Entity type compatibility checking
- Confidence-based filtering and ranking
torch>=1.10.0
transformers>=4.18.0
numpy>=1.20.0
scikit-learn>=1.0.0
tqdm>=4.62.0
Install with: pip install -r requirements.txt
src/: Source codemodel.py: Enhanced model with curriculum-aware attentiondata_quality.py: Data quality enhancement moduledata_augmentation.py: Data augmentation techniquesdata_loader.py: Data loading and preprocessingtrain.py: Training pipeline with curriculum learningevaluate.py: Memory-efficient evaluation scriptconfig.py: Configuration parametersutils.py: Utility functions
data/: DocRED dataset filestrain_annotated.json: Training datadev.json: Development datatest.json: Test datarel_info.json: Relation type information
- Download the data from the following link
- Save it in the
datafolder.
python -m src.trainThis will:
- Enhance data quality (annotation consistency, confidence scoring)
- Apply data augmentation (entity substitution, evidence masking, relation transfer)
- Train the model using curriculum learning (3 stages)
python -m src.evaluateThis will generate predictions in output/results/predictions.json.
To evaluate using the official DocRED metrics:
- Clone the DocRED repository:
git clone https://github.com/thunlp/DocRED.git- Run the evaluation script:
cd DocRED/code
python eval.py -g ../../data/dev.json -p ../../output/results/predictions.json