This repository provides a comprehensive overview of key tasks and decisions involved in the development and optimization of NLP models, particularly focusing on sentence transformers, multi-task learning, training considerations, transfer learning, and layer-wise learning rate implementation.
Key Decisions and Insights:
- Choice of BERT as the Base Model: BERT's ability to capture rich contextual information makes it an excellent choice for sentence embedding.
- Additional Linear Layer: Enhances fine-tuning capabilities while leveraging BERT's pre-trained knowledge.
- Tanh Activation: Introduces non-linearity and normalizes the output, potentially improving downstream tasks.
- Use of [CLS] Token: Captures sentence-level information, ideal for sentence embedding.
Key Decisions and Insights:
- Shared Base Model: A single BERT model is used for multiple tasks, allowing efficient parameter sharing and potential synergies.
- Task-Specific Classifiers: Separate linear layers for each task enable learning of task-specific features while sharing a common representation.
- Choice of Tasks: Sentence classification and sentiment analysis are distinct yet related, potentially allowing the model to learn complementary features.
- Flexible Architecture: The design allows for easy addition of more tasks or modification of existing ones.
Implications and Advantages:
- Very fast training as no parameters are updated.
- Preserves all pre-trained knowledge.
- Useful for feature extraction or when computational resources are limited.
Rationale:
This approach is beneficial when the pre-trained model already captures all the necessary information for your tasks. Ideal for small datasets or tasks very similar to the original training objectives.
Implications and Advantages:
- Faster training compared to fine-tuning the entire model.
- Preserves general language understanding capabilities.
- Allows task-specific adaptation.
Rationale:
This approach strikes a balance between leveraging pre-trained knowledge and adapting to new tasks. Recommended for moderate amounts of task-specific data where tasks differ somewhat from pre-training objectives.
Implications and Advantages:
- Allows for task-specific fine-tuning.
- Prevents catastrophic forgetting for the frozen task.
- Useful when one task is well-calibrated and the other needs adaptation.
Rationale:
This approach is ideal when there's imbalanced performance across tasks or when introducing a new task to an already well-performing model.
Choice of Pre-trained Model:
Consider using SciBERT or BioBERT, which are variants of BERT pre-trained on scientific and biomedical literature, respectively.
Layers to Freeze/Unfreeze:
- Freeze Lower Layers: Preserves general linguistic features.
- Unfreeze Upper Layers: Allows adaptation to the specific language patterns of the new domain.
- Unfreeze Task-Specific Heads: Enables task-specific adaptation.
- Optional Domain-Specific Adaptation Layer: Bridges the gap between general language understanding and domain-specific nuances.
Training Process:
- Start with a pre-trained SciBERT/BioBERT model.
- Replace task-specific heads with a multi-task architecture.
- Freeze lower layers and train on a small learning rate for a few epochs.
- Gradually unfreeze layers from top to bottom.
- Train the entire model end-to-end with a very small learning rate.
Key Decisions and Insights:
- Decreasing Learning Rates for Lower Layers: Preserves general language understanding in lower BERT layers while allowing more adaptation in upper layers.
- Higher Learning Rate for Task-Specific Heads: Facilitates faster adaptation to specific tasks in a multi-task setting.
- Learning Rate Decay Factor: Provides smooth transition across layers, balancing stability and adaptability.
- Separate Treatment of BERT Layers and Task Heads: Allows fine-grained control over model adaptation during training.