This project is engineered to formulate an integrated knowledge graph by synthesizing diagnostic data from multiple healthcare centers, thereby providing a comprehensive view of an individual's health trajectory, with a particular emphasis on entities related to Genes, Diseases, Chemicals, Species, Variants, and Cell Types (DNA or RNA), which are notably significant in the context of rare and/or chronic diseases. Leveraging Named Entity Recognition (NER), Entity Normalization, and Relationship Extraction (RE) techniques on raw medical texts, individual knowledge graphs are created and subsequently merged into a unified graph. This exhaustive visualization supports healthcare professionals in making well-informed decisions, ensuring that no detail, especially those pivotal to understanding and managing genetic information and rare diseases, is neglected from any diagnostic source.
- Clone the repository:
git clone https://github.com/anbianchi/knowledge_frombio cd knowledge_frombio - Create a Conda environment:
conda env create -f environment.yml
- Activate the Conda environment:
conda activate [Your Environment Name]You can utilize the tool in two primary ways: by processing the dataset used in the experiments or manually inserting and processing diagnostic reports. Below are the detailed steps for both approaches:
To process the dataset utilized in the experiments, use the following command:
python main.py --dataset "dataset.csv"
Replace "dataset.csv" with your dataset filename. The script processes the dataset and generates knowledge graphs accordingly.If you prefer to manually input diagnostic reports, place your report files within the diagnostic_reports folder. Ensure that all reports within the folder are related to the same patient to maintain consistency and accuracy in the generated knowledge graph.
python main.py --manual
This command instructs the tool to process the reports present within the diagnostic_reports folder.The experiments utilize the MIMIC-IV-Note: Deidentified free-text clinical notes dataset, a freely accessible critical care database that holds de-identified health-related data associated with over one thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2008 and 2019.
-
De-identification: Adheres to stringent data security and privacy protocols, ensuring that all patient records are thoroughly de-identified, maintaining the privacy and anonymity of the individuals involved.
-
Accessibility: The dataset is publicly available to researchers across the world, fostering a collaborative and open research environment.
In the context of this project, specifically the "discharge.csv" file, in notes folder is used to extract and analyze diagnostic texts. The raw text data from patient reports is processed through our system to generate individual and merged knowledge graphs, which then serve to offer a panoramic view of a patient's medical history and interactions.
To access and use the MIMIC-IV dataset for replicating our experiments or for your research, please follow the steps below:
-
Requesting Access: Visit the MIMIC website and follow their guidelines for requesting access to the dataset.
-
Downloading the Data: Once approved, download the dataset, specifically the "discharge.csv" file found in the
notesfolder. -
Data Processing: Use the script
generate.pyfrom our repository to preprocess the data, converting the notes into a format suitable for our system.
For comprehensive details about the dataset and how to use it, kindly refer to the official documentation.
Note: Even though the dataset is publicly available, we strictly adhere to the usage guidelines provided by MIMIC-IV, ensuring ethical use of the data in our research.
- demo_example/: Folder containing a subset of results.
- modules/: Folder containing the main script and utility functions.
- merged_outputs/ and temp_outputs/: Folders where the output graphs and results will be saved.
- requirements.txt: File listing all necessary Python packages.
- main_script.py: Main script to run the program.