Skip to content

Datasets

João Almeida edited this page Aug 6, 2020 · 1 revision

Datasets

Public datasets

This system was validated using two public datasets:

  • 2018 n2c2 Track on ADE and Medication Extraction Challenge: Contains clinical notes and annotation files with the identification of medications and their respective dosage, strength, route of administration, duration and reason, as well as adverse drug events (ADE).

    • train/                                            Contains discharge summaries (.txt) and annotation files (.ann)
    • test/                                             Contains discharge summaries (.txt) and annotation files (.ann)
    • test_data_Task2/                         NOT USED
    • test_data_Tasks1&3/                   NOT USED
  • 2009 i2b2 Track on Medication Extraction: Contains discharge summaries and annotation files with the identification of medications and their respective dosage, route of administration, frequency and duration.

    • train.test.released.8.17.09/            Contains discharge summary files
    • converted.noduplicates.sorted/   Contains annotation files

These datasets are public but have controlled access. To use them, please request their access here.

Example dataset

To test the pipeline using dummy discharge summaries, three discharge summary samples are provided in Examples/.

Adding new datasets

When adding new datasets to the system, it is necessary to implement a new dataset reader per added dataset. This is a simple process that involves the following steps:

  1. In src/DatasetReader.py, create a reader for the new dataset, ensuring that the output dict follows the format:
{
"train":
	{"file name":{
		"cn": "clinical note",
		"annotation":{"id":("concept","type",[(span,span), ...])},
		"relation":{"id": (annId1, ("concept","type",[(span,span), ...]))}
		}
	}
"test":{...}
}

Even though the pipeline does not currently explore this option, the reader enables the division in "train" and "test" partitions as it might be of interest to use the "train" partition for training and developing annotation systems, and the "test" partition solely for testing purposes.

  1. Add the new reader to readClinicalNotes.

Running the pipeline

Before running the system, it is necessary to:

  • Add the datasets in this directory AND configure the variable name in Settings.ini to select the dataset to be used

   OR

  • Configure the new dataset directory in the Settings.ini file AND configure the variable name in Settings.ini to select the dataset to be used

Clone this wiki locally