Skip to content

CohortCollection

Youcef Sebiat edited this page Mar 5, 2020 · 1 revision

SCALPEL-Analysis: The CohortCollection

In this tutorial, we gonna explore how to use the CohortCollection to conduct analysis over a collection of Cohorts.

The tutorial assumes that you have been given a valid Metadata JSON file, a file which is a result of SCALPEL-Extraction.

%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

import pandas as pd
pd.set_option('display.max_rows', 500)

1. Add SCLAPEL-Analysis to the PythonPath

SCALPEL-Analysis is not yet available through PIP or Conda channels to be installed. However, as is the case for any Python library, it is pretty straight forward to add and use.

Before, proceeding you will need to download SCALPEL-Analysis as a zip from here. Put the zip wherever you judge suitable for you. Please make sure that you meet all the requirements explicited here.

I have downloaded the zip file, and put it under the path /home/sebiat/builds/cnam-model-001-interactions/dist/scalpel.zip.

There is two ways of doing it:

  1. Permanently add a directory to PYTHONPATH environmment variable. This will allow you to add it once and for all.
  2. Add it through sys import as shown below.
import sys
project_path = '/home/sebiat/builds/cnam-model-001-interactions/dist/scalpel.zip'
sys.path.append(project_path)

2. Creating a CohortCollection

You can directly and by hand. But this is not recomended nor is the best way of doing it.

The canonic way is by loading a metadata file that has been created through SCALPEL-Extraction.

from scalpel.core.cohort_collection import CohortCollection

cc = CohortCollection.from_json("metadata_fall_2020_01_27_16_46_39.json")
cc.cohorts_names
{'Cardiac',
 'HTA',
 'IPP',
 'Opioids',
 'acts',
 'control_drugs_exposures',
 'control_drugs_purchases',
 'diagnoses',
 'drug_purchases',
 'epileptics',
 'exposures',
 'extract_patients',
 'filter_patients',
 'follow_up',
 'fractures',
 'interactions',
 'liberal_acts',
 'prescriptions',
 'prescriptions_exposures'}

Right, we officially have a CohortCollection with a number of of Cohorts. But wait, what it is a CohortCollection? No rush, here is the definition from the SCALPEL paper, section 2.5:

The CohortCollection abstraction is a collection of Cohorts on which operations can be jointly performed. The CohortCollection has metadata that keeps information about each Cohort, such as the successive operations performed on it, the Parquet files they are stored in and a git commit hash of the code producing the extraction from the Source.

CohortCollection can be seen in the same way a list or dict in Python is seen. It is a bag full of Cohorts that allows to iterate over, but also allows to apply operations to all the Cohorts at once. Finally, it allows to do specific operations for the CohortCollection such as finding the basic subjects Cohort which is the largest Cohort in the CohortCollection. We have a dedicated tutorial for CohortCollection.

3. First view of the CohortCollection: a bag of Cohorts

We can look at a CohortCollection from different view points. The first one, is to just see it Python dict, where the keys are the unique names of the Cohorts and the items are Cohort objects. To access this view, you just have to access the property cohorts of the CohortCollection.

for (name, cohort) in cc.cohorts.items():
    print(cohort.describe())
Events are Opioids. Events contain only subjects with event Opioids.
Events are IPP. Events contain only subjects with event IPP.
Events are Cardiac. Events contain only subjects with event Cardiac.
Events are epileptics. Events contain only subjects with event epileptics.
Events are HTA. Events contain only subjects with event HTA.
Events are drug_purchases. Events contain only subjects with event drug_purchases.
This a subject cohort, no event needed. Subjects are from operation extract_patients.
This a subject cohort, no event needed. Subjects are from operation filter_patients.
Events are follow_up. Events contain only subjects with event follow_up.
Events are control_drugs_purchases. Events contain only subjects with event control_drugs_purchases.
Events are control_drugs_exposures. Events contain only subjects with event control_drugs_exposures.
Events are prescriptions. Events contain only subjects with event prescriptions.
Events are prescriptions_exposures. Events contain only subjects with event prescriptions_exposures.
Events are exposures. Events contain only subjects with event exposures.
Events are interactions. Events contain only subjects with event interactions.
Events are diagnoses. Events contain only subjects with event diagnoses.
Events are acts. Events contain only subjects with event acts.
Events are liberal_acts. Events contain only subjects with event liberal_acts.
Events are fractures. Events contain only subjects with event fractures.

You can thus perform all of the operations that are related to classic collection manipulation. For instance, if you have a date filter that you want to apply to all you Cohort at once, you can iterate over CohortCollection and return a new CohortCollection.

Our aim is to make CohortCollection a functor, but for now you can do it in the old fashion way.

Nevertheless, other operations are already implemented, and can be called directly from the CohortCollection object:

  1. Intersection.
  2. Union.
  3. Difference.
  4. Existence: checks if a Cohort is in the CohortCollection, only by name.
  5. Adding.
  6. Accessing by name.

You can have a further look at these operations alone.

4. Second view: storage of Cohorts

CohortCollection can be used to save and load Cohorts, and since it is a dict it can be very easy to translate it to JSON format.

# You can easily save the CohortCollection

save_dict = cc.save('/user/sebiat/cohortcollection')
# Load it back 
loaded_cc = CohortCollection.load(save_dict)

loaded_cc.cohorts_names
{'Cardiac',
 'HTA',
 'IPP',
 'Opioids',
 'acts',
 'control_drugs_exposures',
 'control_drugs_purchases',
 'diagnoses',
 'drug_purchases',
 'epileptics',
 'exposures',
 'extract_patients',
 'filter_patients',
 'follow_up',
 'fractures',
 'interactions',
 'liberal_acts',
 'prescriptions',
 'prescriptions_exposures'}

5. Add subject information

In the majority of studies do not provide demographics about the subjects of each Cohort. Generally, there is one big Cohort that present the largest number of subjects and has all the demographics. CohortCollection allow to get that specific Cohort and spread its demographics through the whole collection.,

# First check that the demographics are not present 
cc.get('fractures').subjects.show(5)
+-----------+
|  patientID|
+-----------+
|XXXXXXXXXXX|
|XXXXXXXXXXX|
|XXXXXXXXXXX|
|XXXXXXXXXXX|
|XXXXXXXXXXX|
+-----------+
only showing top 5 rows
# Add information

import pytz

AGE_REFERENCE_DATE = pytz.datetime.datetime(2015, 1, 1, tzinfo=pytz.UTC)
cc.add_subjects_information('omit_all', reference_date=AGE_REFERENCE_DATE)

cc.get('fractures').subjects.show(5)
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
2020-03-05 16:27:31 WARNING : Some patients and their events might be ommited
+-----------+------+-------------------+-------------------+---+---------+
|  patientID|gender|          birthDate|          deathDate|age|ageBucket|
+-----------+------+-------------------+-------------------+---+---------+
|XXXXXXXXXXX|     1|1939-03-01 00:00:00|               null| 75|       15|
|XXXXXXXXXXX|     1|1924-10-01 01:00:00|               null| 90|       18|
|XXXXXXXXXXX|     2|1940-02-01 00:00:00|               null| 74|       14|
|XXXXXXXXXXX|     1|1920-08-01 01:00:00|2013-12-21 01:00:00| 94|       18|
|XXXXXXXXXXX|     1|1924-04-01 01:00:00|               null| 90|       18|
+-----------+------+-------------------+-------------------+---+---------+
only showing top 5 rows

As you can see the demographics has been added.

Clone this wiki locally