HiCur-NPC: Hierarchical Feature Fusion Curriculum Learning for Multi-Modal Foundation Model in Nasopharyngeal Carcinoma
🎉 October 29: Updated the complete data collection and organization process, providing detailed download and usage plans for our dataset! 📊 For more information, check out the Data Update Documentation.
🎥 October 30: Added a video demo showcasing model inference.
🚀 January 5: Updated the method for quickly migrating HiCur-NPC to other data tasks. Using CXR data as an example, we developed HiCur-CXR.
📽️ Demonstration Video: Check out our demonstration video to see the HiCur-NPC model in action! This video showcases the model's capabilities and how it processes nasopharyngeal carcinoma data.
HiCur-with-CC.mp4
Providing precise and comprehensive diagnostic information to clinicians is crucial for improving the treatment and prognosis of nasopharyngeal carcinoma. Multi-modal foundation models, which can integrate data from various sources, have the potential to significantly enhance clinical assistance. However, several challenges remain:
- The lack of large-scale visual-language datasets for nasopharyngeal carcinoma.
- Existing pre-training and fine-tuning methods that cannot learn the necessary hierarchical features for complex clinical tasks.
- Current foundation models having limited visual perception due to inadequate integration of multi-modal information.
While curriculum learning can improve a model's ability to handle multiple tasks through systematic knowledge accumulation, it still lacks consideration for hierarchical features and their dependencies, affecting knowledge gains. To address these issues, we propose the Hierarchical Feature Fusion Curriculum Learning (HFFCL) method, which consists of three stages:
- Visual Knowledge Learning (Stage I): We introduce the Hybrid Contrastive Masked Autoencoder (HCMAE) to pre-train visual encoders on 755K multi-modal images of nasopharyngeal carcinoma CT, MRI, and endoscopy to fully extract deep visual information.
- Coarse-Grained Alignment (Stage II): We construct a 65K visual instruction fine-tuning dataset based on open-source data and clinician diagnostic reports, achieving coarse-grained alignment with visual information in a large language model.
- Fine-Grained Fusion (Stage III): We design a Mixture of Experts Cross Attention structure for deep fine-grained fusion of global multimodal information.
Our model outperforms previously developed specialized models in all key clinical tasks for nasopharyngeal carcinoma, including diagnosis, report generation, tumor segmentation, and prognosis.
StageI-HCMAE
: Contains code and resources for visual knowledge learning using the Hybrid Contrastive Masked Autoencoder.StageII-CGA
: Includes scripts and datasets for coarse-grained alignment.StageIII-FGF
: Hosts the implementation for fine-grained fusion using the Mixture of Experts Cross Attention structure.test
: Provides the complete model architecture and inference examples.
This is not the full version of the repository. Some code is currently being refined and will be released once it has been validated and reconstructed to ensure usability.
To install the necessary dependencies, run:
pip install -r requirements.txt
Detailed instructions for each stage can be found within their respective folders. To test the complete model, navigate to the test
directory and follow the instructions in the README file provided there.
We welcome contributions from the community. Please fork the repository and submit a pull request with your changes. Ensure your code adheres to our style guidelines and includes appropriate tests.
This project is licensed under the Apache License. See the LICENSE file for more details.
For any questions or inquiries, please contact us at [email protected].