|
1 | | -# SNIIRAM-exploration |
| 1 | +[](https://circleci.com/gh/X-DataInitiative/SCALPEL-Analysis) |
| 2 | +[](https://codecov.io/gh/X-DataInitiative/SCALPEL-Analysis) |
| 3 | +[](https://opensource.org/licenses/BSD-3-Clause) |
| 4 | + |
2 | 5 |
|
3 | | -Library that offers util abstractions to explore data extracted |
4 | | -using SNIIRAM-featuring. |
| 6 | +# SCALPEL-Analysis |
5 | 7 |
|
6 | | -Clone this repo and add it to the path to use it in notebooks. |
| 8 | +SCALPEL-Analysis is a Library part of the SCALPEL3 framework resulting from a research Partnership between [École Polytechnique](https://www.polytechnique.edu/en) & |
| 9 | + [Caisse Nationale d'Assurance Maladie](https://assurance-maladie.ameli.fr/qui-sommes-nous/fonctionnement/organisation/cnam-tete-reseau) |
| 10 | + started in 2015 by [Emmanuel Bacry](http://www.cmap.polytechnique.fr/~bacry/) and [Stéphane Gaïffas](https://stephanegaiffas.github.io/). |
| 11 | + Since then, many research engineers and PhD students developped and used this framework |
| 12 | + to do research on SNDS data, the full list of contributors is available in [CONTRIBUTORS.md](CONTRIBUTORS.md). |
| 13 | + This library is based on [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html). It provides |
| 14 | + useful abstractions easing cohort data analysis and manipulation. While it can be used |
| 15 | + as a standalone, it expects inputs formatted as the data resulting from |
| 16 | + SCALPEL-Extraction concept extraction, that is, a metadata.json file, tracking the |
| 17 | + cohorts data on disk or on HDFS: |
7 | 18 |
|
8 | | -## Requirements |
| 19 | +```json |
| 20 | +{ |
| 21 | + "operations" : [ { |
| 22 | + "name" : "base_population", |
| 23 | + "inputs" : [ "DCIR", "MCO", "IR_BEN_R", "MCO_CE" ], |
| 24 | + "output_type" : "patients", |
| 25 | + "output_path" : "/some/path/to/base_population/data", |
| 26 | + "population_path" : "" |
| 27 | + }, { |
| 28 | + "name" : "drug_dispenses", |
| 29 | + "inputs" : [ "DCIR", "MCO", "MCO_CE" ], |
| 30 | + "output_type" : "acts", |
| 31 | + "output_path" : "/some/path/to/drug_dispenses/data", |
| 32 | + "population_path" : "/some/path/to/drug_dispenses/patients" |
| 33 | + }, ... ] |
| 34 | +} |
| 35 | +``` |
9 | 36 |
|
10 | | -This needs python 3.5.3 or above. |
| 37 | +where: |
11 | 38 |
|
12 | | -Make sure that you have a requierments-dev based active environnement. |
| 39 | +- `name` contains the cohort name |
| 40 | +- `inputs` indicates the data sources used to compute this cohort |
| 41 | +- `ouput_type` indicates if the cohort contains only `patients` or some event type (can be custom) |
| 42 | +- `output_path` contains the path to a parquet file containing the data |
| 43 | +- When `output_type` is not `patients`, `output_path` is used to store events. In this case, |
| 44 | + `population_path` points to a parquet file containing data on the population. |
13 | 45 |
|
14 | | - conda create -n exploration python=3.5.3 |
15 | | - pip install -r requirements-dev.txt |
| 46 | +In our example, the input DataFrames contain data in parquet format. If we import this |
| 47 | +data with PySpark and output it as strings, it should look like this : |
16 | 48 |
|
17 | | -## Running tests |
18 | | -On your dev environnement, just launch the following command in the root of the project: |
| 49 | +``` |
| 50 | +base_population/data |
| 51 | ++---------+------+-------------------+-------------------+ |
| 52 | +|patientID|gender| birthDate| deathDate| |
| 53 | ++---------+------+-------------------+-------------------+ |
| 54 | +| Alice| 2|1934-07-27 00:00:00| null| |
| 55 | +| Bob| 1|1951-05-01 00:00:00| null| |
| 56 | +| Carole| 2|1942-01-12 00:00:00| null| |
| 57 | +| Chuck| 1|1933-10-03 00:00:00|2011-06-20 00:00:00| |
| 58 | +| Craig| 1|1943-07-27 00:00:00|2012-12-10 00:00:00| |
| 59 | +| Dan| 1|1971-10-07 00:00:00| null| |
| 60 | +| Erin| 2|1924-01-12 00:00:00| null| |
| 61 | ++---------+------+-------------------+-------------------+ |
| 62 | +``` |
19 | 63 |
|
20 | | - nosetests |
21 | | - |
22 | | -## Development |
| 64 | +``` |
| 65 | +drug_dispenses/data |
| 66 | ++---------+--------+-------+-----+------+-------------------+-------------------+ |
| 67 | +|patientID|category|groupID|value|weight| start| end| |
| 68 | ++---------+--------+-------+-----+------+-------------------+-------------------+ |
| 69 | +| Alice|exposure| null|DrugA| 1.0|2013-08-08 00:00:00|2013-10-07 00:00:00| |
| 70 | +| Alice|exposure| null|DrugB| 1.0|2012-09-11 00:00:00|2012-12-30 00:00:00| |
| 71 | +| Alice|exposure| null|DrugC| 1.0|2013-01-23 00:00:00|2013-03-24 00:00:00| |
| 72 | +| Carole|exposure| null|DrugB| 1.0|2010-01-25 00:00:00|2010-12-13 00:00:00| |
| 73 | +| Dan|exposure| null|DrugA| 1.0|2012-11-29 00:00:00|2013-01-28 00:00:00| |
| 74 | +| Erin|exposure| null|DrugC| 1.0|2010-09-09 00:00:00|2011-01-17 00:00:00| |
| 75 | +| Eve|exposure| null|DrugA| 1.0|2010-04-30 00:00:00|2010-08-02 00:00:00| |
| 76 | ++---------+--------+-------+-----+------+-------------------+-------------------+ |
| 77 | +``` |
| 78 | + |
| 79 | +``` |
| 80 | +drug_dispenses/patients |
| 81 | ++---------+ |
| 82 | +|patientID| |
| 83 | ++---------+ |
| 84 | +| Alice| |
| 85 | +| Carole| |
| 86 | +| Dan| |
| 87 | +| Erin| |
| 88 | +| Eve| |
| 89 | ++---------+ |
| 90 | +``` |
| 91 | + |
| 92 | +In these tables, |
| 93 | + |
| 94 | +* `patientID` is a string identifying patients |
| 95 | +* `gender` is an int indicating gender (1 for male, 2 for female ; we use the same coding as SNDS's) |
| 96 | +* `birthDate` and `deathDate` are datetime, `deathDate` can be null |
| 97 | +* `category` a string, used to indicate event types (drug purchase, act, drug exposure, etc.). It can be custom. |
| 98 | +* `groupID` is a string. It is a "free" field, which is often used to perform aggregations. For example, you can use it to |
| 99 | +indicate drug ATC classes. |
| 100 | +* `value` is a string, used to indicate the precise nature of the event. For example, it can |
| 101 | +contain the CIP13 code of a drug or a ICD10 code of a disease. |
| 102 | +* `weight` is a float, it can be used to represent quantitative information tied to the event, |
| 103 | +such as the number of purchased boxes for drug purchase events |
| 104 | + |
| 105 | +An event is defined by the tuple `(patientID, category, groupID, value, weight, start, end)`. |
| 106 | +`category`, `groupID`, `value` and `weight` are flexible fields, you can fill them with |
| 107 | +the data which best suits your needs. |
| 108 | + |
| 109 | +Note that the set of subjects present in `population` and `drug_dispenses` do not need to be exactly the same. |
| 110 | + |
| 111 | +### Loading data into Cohorts |
| 112 | +One can either create cohorts manually: |
| 113 | + |
| 114 | +```python |
| 115 | +from pyspark.sql import SparkSession |
| 116 | +from scalpel.core.cohort import Cohort |
| 117 | + |
| 118 | +spark = SparkSession.builder.appName('SCALPEL-Analysis-example').getOrCreate() |
| 119 | +events = spark.read.parquet('/some/path/to/drug_dispenses/data') |
| 120 | +subjects = spark.read.parquet('/some/path/to/drug_dispenses/patients') |
| 121 | +drug_dispense_cohort = Cohort('drug_dispenses', |
| 122 | + 'Cohort of subjects having drug dispenses events', |
| 123 | + subjects, |
| 124 | + events) |
| 125 | +``` |
| 126 | + |
| 127 | +or read import all the cohorts from a metadata.json file: |
| 128 | + |
| 129 | +```python |
| 130 | +from scalpel.core.cohort_collection import CohortCollection |
| 131 | +cc = CohortCollection.from_json('/path/to/metadata.json') |
| 132 | +print(cc.cohorts_names) # Should print ['base_population', 'drug_dispenses'] |
| 133 | +drug_dispenses_cohort = cc.get('drug_dispenses') |
| 134 | +base_population_cohort = cc.get('base_population') |
| 135 | +# To access cohort data: |
| 136 | +drug_dispenses_cohort.subjects |
| 137 | +drug_dispenses_cohort.events |
| 138 | +``` |
| 139 | + |
| 140 | +## Cohort manipulation |
| 141 | + |
| 142 | +Cohorts can be manipulated easily, thanks to algebraic manipulations: |
| 143 | + |
| 144 | +```python |
| 145 | +# Subjects in base population who have drug dispenses |
| 146 | +study_cohort = base_population_cohort.intersection(drug_dispenses_cohort) |
| 147 | +# Subjects in base population who have no drug dispenses |
| 148 | +study_cohort = base_population_cohort.difference(drug_dispenses_cohort) |
| 149 | +# All the subjects either in base population or who have drug dispenses |
| 150 | +study_cohort = base_population_cohort.union(drug_dispenses_cohort) |
| 151 | +``` |
| 152 | + |
| 153 | +Note that these operations are not commutative, as |
| 154 | +`base_population_cohort.union(drug_dispenses_cohort)` is not equivalent to |
| 155 | +`drug_dispenses_cohort.union(base_population_cohort)`. Indeed, for now, these |
| 156 | + operations are based on `cohort.subjects`. It means that `foo` will not contain events, |
| 157 | + are there are no events in `base_population`, while `bar` will contain the events |
| 158 | + derived from `drug_dispenses_cohort`. |
| 159 | + |
| 160 | +We plan to extend these manipulation in a near future to allow performing operations on |
| 161 | +subjects and events in a single line of code. |
| 162 | + |
| 163 | +## CohortFlow |
| 164 | +`CohortFlow` objects can be used to track the evolution of a study population during the |
| 165 | +cohort design process. Let us assume that you have a `CohortCollection` containing |
| 166 | +`base_population`, `exposed`, `cases`, respectively containing the base population of |
| 167 | +your study, the subjects exposed to some drugs and their exposure events, the subjects |
| 168 | +having some disease and their disease events. |
| 169 | + |
| 170 | +`CohortFlow` allows you to check changes in your population structure when while working |
| 171 | +on your cohort: |
| 172 | + |
| 173 | +```python |
| 174 | +import matplotlib.pyplot as plt |
| 175 | +from scalpel.stats.patients import distribution_by_gender_age_bucket |
| 176 | +from scalpel.core.cohort_flow import CohortFlow |
| 177 | + |
| 178 | +ordered_cohorts = [exposed, cases] |
| 179 | + |
| 180 | +flow = CohortFlow(ordered_cohorts) |
| 181 | +# We use 'extract_patients' as the base population |
| 182 | +steps = flow.compute_steps(base_population) |
| 183 | + |
| 184 | +for cohort in flow.steps: |
| 185 | + figure = plt.figure(figsize=(8, 4.5)) |
| 186 | + distribution_by_gender_age_bucket(cohort=cohort, figure=figure) |
| 187 | + plt.show() |
| 188 | +``` |
| 189 | + |
| 190 | +In this example, `CohortFlow` computes iteratively the intersection between the base |
| 191 | +cohort (`base_population`) and the cohorts in `ordered_cohort`, resulting in three |
| 192 | +steps: |
| 193 | + |
| 194 | +* `base_population` : all subjects |
| 195 | +* `base_population.intersection(exposed)` : exposed subjects |
| 196 | +* `base_population.intersection(exposed).intersection(cases)` : exposed subjects who |
| 197 | +are cases |
| 198 | + |
| 199 | +Calling `distribution_by_gender_age_bucket` at each step allows us to track any change |
| 200 | +in demographics induced by restricting the subjects to the exposed cases. |
| 201 | + |
| 202 | +Many more plotting and statistical logging available in `scalpel.stats` can be used the |
| 203 | +same way. |
| 204 | + |
| 205 | +## Installation |
| 206 | +Clone this repo and add it to the `PYTHONPATH` to use it in scripts or notebooks. To add |
| 207 | +the library temporarily to your `PYTHONPATH`, just add |
| 208 | + |
| 209 | + import sys |
| 210 | + sys.path.append('/path/to/the/SCALPEL-Analysis') |
| 211 | + |
| 212 | +at the beginning of your scripts. |
| 213 | + |
| 214 | +> **Important remark** : This software is currently in alpha stage. It should be fairly stable, |
| 215 | +> but the API might still change and the documentation is partial. We are currently doing our best |
| 216 | +> to improve documentation coverage as quickly as possible. |
| 217 | +
|
| 218 | +### Requirements |
| 219 | + |
| 220 | +Python 3.6.5 or above and libraries listed in |
| 221 | +[requirements.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements.txt). |
| 222 | + |
| 223 | +To create a virtual environment with `conda` and install the requirements, just run |
| 224 | + |
| 225 | + conda create -n <env name> python=3.5.3 |
| 226 | + pip install -r requirements.txt |
| 227 | + |
| 228 | +## Citation |
| 229 | + |
| 230 | +If you use a library part of _SCALPEL3_ in a scientific publication, we would appreciate citations. You can use the following bibtex entry: |
| 231 | + |
| 232 | + @article{2019arXiv191007045, |
| 233 | + author = {{Bacry}, E and {Ga{\"{i}}ffas}, S. and {Leroy}, F. and {Morel}, M. and {Nguyen}, D. P. and {Sebiat}, Y. and {Sun}, D.} |
| 234 | + title = {{SCALPEL3: a scalable open-source library for healthcare claims databases}}, |
| 235 | + journal = {ArXiv e-prints}, |
| 236 | + eprint = {1910.07045}, |
| 237 | + url = {http://arxiv.org/abs/1910.07045}, |
| 238 | + year = 2019, |
| 239 | + month = oct |
| 240 | + } |
| 241 | + |
| 242 | +## Contributing |
23 | 243 | The development cycle is opinionated. Each time you commit, git will |
24 | 244 | launch four checks before it allows you to finish your commit: |
25 | | -1. Black: we encourage you to install it and integrate to your dev |
26 | | -tool such as Pycharm. Check this [link](https://github.com/ambv/black). We massively encourage |
27 | | -to use it with Pycharm as it will automatically |
28 | | -2. Flake8: enforces some extra checks. |
29 | | -3. Testing using Nosetests. |
| 245 | +1. We use [black](https://github.com/ambv/black) to format the code. |
| 246 | +We encourage you to install it and integrate to your code editor or IDE. |
| 247 | +2. Some extra checks are done using Flake8 |
| 248 | +3. Testing with Nosetests |
30 | 249 | 4. Coverage checks if the minimum coverage is ensured. |
31 | 250 |
|
32 | | -After cloning, you have to run in the root of the repo: |
| 251 | +To activate the pre-commit hook, you just have to install the |
| 252 | +[requirements-dev.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements-dev.txt) |
| 253 | +dependencies and to run: |
33 | 254 |
|
34 | | - source activate exploration |
35 | | - pre-commit install |
| 255 | + source activate <env name> |
| 256 | + cd SCALPEL-Analysis |
| 257 | + pre-commit install |
| 258 | + |
| 259 | +To launch the tests, just run |
| 260 | + |
| 261 | + cd SCALPEL-Analysis |
| 262 | + nosetests |
0 commit comments