Skip to content

Commit 8ad569e

Browse files
authored
Merge pull request #21 from X-DataInitiative/rename-library
Naming & doc prior to release
2 parents f26cb5c + 4f26c28 commit 8ad569e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1214
-747
lines changed

.circleci/config.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
command: |
3131
eval "$(pyenv init -)"
3232
pyenv local 3.5.3
33-
cat /dev/null | python -m nose --with-coverage --cover-package=src/exploration/core --cover-package=src/exploration/loaders --cover-package=src/exploration/flattening
33+
cat /dev/null | python -m nose --with-coverage --cover-package=scalpel/core --cover-package=scalpel/loaders --cover-package=scalpel/flattening
3434
3535
- run:
3636
name: Run coverage

.coveragerc

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[run]
22
branch = True
3-
source = src/exploration
3+
source = scalpel
44

55
[report]
66
exclude_lines =
@@ -11,6 +11,6 @@ exclude_lines =
1111
ignore_errors = True
1212
omit =
1313
tests/*
14-
src/libs/*
15-
src/stats/*
16-
src/study/*
14+
scalpel/libs/*
15+
scalpel/stats/*
16+
scalpel/study/*

.gitignore

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
1+
.idea/
12

2-
3-
\.idea/
4-
5-
src/libs/
3+
scalpel/libs/
64

75
test\.py
86

97
*.pyc
108

119
*.log
1210

13-
\.coverage
11+
.coverage
12+
13+
spark-warehouse
14+
scalpel/__pycache__/
15+
scalpel/core/__pycache__/

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ repos:
1515
hooks:
1616
- id: nosetests
1717
name: nosetests
18-
entry: bash -ec 'nosetests --with-coverage --cover-package=src/exploration/core --cover-package=src/exploration/loaders --cover-package=src/exploration/flattening'
18+
entry: bash -ec 'nosetests --with-coverage --cover-package=scalpel/core --cover-package=scalpel/drivers --cover-package=scalpel/flattening'
1919
language: system
2020
files: \.py$
2121
- repo: local

CONTRIBUTORS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Contributors
22

3-
The _SNIIRAM-exploration_ package was initially implemented by researchers, developers, and PhD students at [CMAP](http://www.cmap.polytechnique.fr/?lang=en).
3+
The _SCALPEL-Analysis_ package was initially implemented by researchers, developers, and PhD students at [CMAP](http://www.cmap.polytechnique.fr/?lang=en).
44

55
## List of Contributors
66

77
- Youcef Sebiat
8+
- Maryan Morel
9+
- Dinh Phong Nguyen

LICENSE.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
BSD 3-Clause License
22

3-
Copyright (c) 2018, The SNIIRAM-exploration developers
3+
Copyright (c) 2019, The SCALPEL-Analysis developers
44
All rights reserved.
55

66
Redistribution and use in source and binary forms, with or without

Makefile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# License: BSD 3 clause
2+
13
help:
24
@echo "clean - remove all build, test, coverage and Python artifacts"
35
@echo "clean-pyc - remove Python file artifacts"
@@ -50,5 +52,5 @@ test:
5052

5153
build: clean
5254
mkdir ./dist
53-
zip -x main.py -x \*libs\* -r ./dist/exploration.zip .
55+
zip -x main.py -x \*libs\* -r ./dist/scalpel.zip .
5456
cd ./src/libs && zip -r ../../dist/libs.zip .

README.md

Lines changed: 249 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,35 +1,262 @@
1-
# SNIIRAM-exploration
1+
[![CircleCI](https://circleci.com/gh/X-DataInitiative/SCALPEL-Analysis/tree/master.svg?style=shield&circle-token=77551e927f0d9f66b6c4755743d2cb7f5753395c)](https://circleci.com/gh/X-DataInitiative/SCALPEL-Analysis)
2+
[![codecov](https://codecov.io/gh/X-DataInitiative/SCALPEL-Analysis/branch/master/graph/badge.svg?token=f78o8HzmAl)](https://codecov.io/gh/X-DataInitiative/SCALPEL-Analysis)
3+
[![License](https://img.shields.io/badge/License-BSD%203--Clause-blue.svg)](https://opensource.org/licenses/BSD-3-Clause)
4+
![Version](https://img.shields.io/github/v/release/X-DataInitiative/SCALPEL-Analysis?include_prereleases)
25

3-
Library that offers util abstractions to explore data extracted
4-
using SNIIRAM-featuring.
6+
# SCALPEL-Analysis
57

6-
Clone this repo and add it to the path to use it in notebooks.
8+
SCALPEL-Analysis is a Library part of the SCALPEL3 framework resulting from a research Partnership between [École Polytechnique](https://www.polytechnique.edu/en) &
9+
[Caisse Nationale d'Assurance Maladie](https://assurance-maladie.ameli.fr/qui-sommes-nous/fonctionnement/organisation/cnam-tete-reseau)
10+
started in 2015 by [Emmanuel Bacry](http://www.cmap.polytechnique.fr/~bacry/) and [Stéphane Gaïffas](https://stephanegaiffas.github.io/).
11+
Since then, many research engineers and PhD students developped and used this framework
12+
to do research on SNDS data, the full list of contributors is available in [CONTRIBUTORS.md](CONTRIBUTORS.md).
13+
This library is based on [PySpark](https://spark.apache.org/docs/latest/api/python/pyspark.html). It provides
14+
useful abstractions easing cohort data analysis and manipulation. While it can be used
15+
as a standalone, it expects inputs formatted as the data resulting from
16+
SCALPEL-Extraction concept extraction, that is, a metadata.json file, tracking the
17+
cohorts data on disk or on HDFS:
718

8-
## Requirements
19+
```json
20+
{
21+
"operations" : [ {
22+
"name" : "base_population",
23+
"inputs" : [ "DCIR", "MCO", "IR_BEN_R", "MCO_CE" ],
24+
"output_type" : "patients",
25+
"output_path" : "/some/path/to/base_population/data",
26+
"population_path" : ""
27+
}, {
28+
"name" : "drug_dispenses",
29+
"inputs" : [ "DCIR", "MCO", "MCO_CE" ],
30+
"output_type" : "acts",
31+
"output_path" : "/some/path/to/drug_dispenses/data",
32+
"population_path" : "/some/path/to/drug_dispenses/patients"
33+
}, ... ]
34+
}
35+
```
936

10-
This needs python 3.5.3 or above.
37+
where:
1138

12-
Make sure that you have a requierments-dev based active environnement.
39+
- `name` contains the cohort name
40+
- `inputs` indicates the data sources used to compute this cohort
41+
- `ouput_type` indicates if the cohort contains only `patients` or some event type (can be custom)
42+
- `output_path` contains the path to a parquet file containing the data
43+
- When `output_type` is not `patients`, `output_path` is used to store events. In this case,
44+
`population_path` points to a parquet file containing data on the population.
1345

14-
conda create -n exploration python=3.5.3
15-
pip install -r requirements-dev.txt
46+
In our example, the input DataFrames contain data in parquet format. If we import this
47+
data with PySpark and output it as strings, it should look like this :
1648

17-
## Running tests
18-
On your dev environnement, just launch the following command in the root of the project:
49+
```
50+
base_population/data
51+
+---------+------+-------------------+-------------------+
52+
|patientID|gender| birthDate| deathDate|
53+
+---------+------+-------------------+-------------------+
54+
| Alice| 2|1934-07-27 00:00:00| null|
55+
| Bob| 1|1951-05-01 00:00:00| null|
56+
| Carole| 2|1942-01-12 00:00:00| null|
57+
| Chuck| 1|1933-10-03 00:00:00|2011-06-20 00:00:00|
58+
| Craig| 1|1943-07-27 00:00:00|2012-12-10 00:00:00|
59+
| Dan| 1|1971-10-07 00:00:00| null|
60+
| Erin| 2|1924-01-12 00:00:00| null|
61+
+---------+------+-------------------+-------------------+
62+
```
1963

20-
nosetests
21-
22-
## Development
64+
```
65+
drug_dispenses/data
66+
+---------+--------+-------+-----+------+-------------------+-------------------+
67+
|patientID|category|groupID|value|weight| start| end|
68+
+---------+--------+-------+-----+------+-------------------+-------------------+
69+
| Alice|exposure| null|DrugA| 1.0|2013-08-08 00:00:00|2013-10-07 00:00:00|
70+
| Alice|exposure| null|DrugB| 1.0|2012-09-11 00:00:00|2012-12-30 00:00:00|
71+
| Alice|exposure| null|DrugC| 1.0|2013-01-23 00:00:00|2013-03-24 00:00:00|
72+
| Carole|exposure| null|DrugB| 1.0|2010-01-25 00:00:00|2010-12-13 00:00:00|
73+
| Dan|exposure| null|DrugA| 1.0|2012-11-29 00:00:00|2013-01-28 00:00:00|
74+
| Erin|exposure| null|DrugC| 1.0|2010-09-09 00:00:00|2011-01-17 00:00:00|
75+
| Eve|exposure| null|DrugA| 1.0|2010-04-30 00:00:00|2010-08-02 00:00:00|
76+
+---------+--------+-------+-----+------+-------------------+-------------------+
77+
```
78+
79+
```
80+
drug_dispenses/patients
81+
+---------+
82+
|patientID|
83+
+---------+
84+
| Alice|
85+
| Carole|
86+
| Dan|
87+
| Erin|
88+
| Eve|
89+
+---------+
90+
```
91+
92+
In these tables,
93+
94+
* `patientID` is a string identifying patients
95+
* `gender` is an int indicating gender (1 for male, 2 for female ; we use the same coding as SNDS's)
96+
* `birthDate` and `deathDate` are datetime, `deathDate` can be null
97+
* `category` a string, used to indicate event types (drug purchase, act, drug exposure, etc.). It can be custom.
98+
* `groupID` is a string. It is a "free" field, which is often used to perform aggregations. For example, you can use it to
99+
indicate drug ATC classes.
100+
* `value` is a string, used to indicate the precise nature of the event. For example, it can
101+
contain the CIP13 code of a drug or a ICD10 code of a disease.
102+
* `weight` is a float, it can be used to represent quantitative information tied to the event,
103+
such as the number of purchased boxes for drug purchase events
104+
105+
An event is defined by the tuple `(patientID, category, groupID, value, weight, start, end)`.
106+
`category`, `groupID`, `value` and `weight` are flexible fields, you can fill them with
107+
the data which best suits your needs.
108+
109+
Note that the set of subjects present in `population` and `drug_dispenses` do not need to be exactly the same.
110+
111+
### Loading data into Cohorts
112+
One can either create cohorts manually:
113+
114+
```python
115+
from pyspark.sql import SparkSession
116+
from scalpel.core.cohort import Cohort
117+
118+
spark = SparkSession.builder.appName('SCALPEL-Analysis-example').getOrCreate()
119+
events = spark.read.parquet('/some/path/to/drug_dispenses/data')
120+
subjects = spark.read.parquet('/some/path/to/drug_dispenses/patients')
121+
drug_dispense_cohort = Cohort('drug_dispenses',
122+
'Cohort of subjects having drug dispenses events',
123+
subjects,
124+
events)
125+
```
126+
127+
or read import all the cohorts from a metadata.json file:
128+
129+
```python
130+
from scalpel.core.cohort_collection import CohortCollection
131+
cc = CohortCollection.from_json('/path/to/metadata.json')
132+
print(cc.cohorts_names) # Should print ['base_population', 'drug_dispenses']
133+
drug_dispenses_cohort = cc.get('drug_dispenses')
134+
base_population_cohort = cc.get('base_population')
135+
# To access cohort data:
136+
drug_dispenses_cohort.subjects
137+
drug_dispenses_cohort.events
138+
```
139+
140+
## Cohort manipulation
141+
142+
Cohorts can be manipulated easily, thanks to algebraic manipulations:
143+
144+
```python
145+
# Subjects in base population who have drug dispenses
146+
study_cohort = base_population_cohort.intersection(drug_dispenses_cohort)
147+
# Subjects in base population who have no drug dispenses
148+
study_cohort = base_population_cohort.difference(drug_dispenses_cohort)
149+
# All the subjects either in base population or who have drug dispenses
150+
study_cohort = base_population_cohort.union(drug_dispenses_cohort)
151+
```
152+
153+
Note that these operations are not commutative, as
154+
`base_population_cohort.union(drug_dispenses_cohort)` is not equivalent to
155+
`drug_dispenses_cohort.union(base_population_cohort)`. Indeed, for now, these
156+
operations are based on `cohort.subjects`. It means that `foo` will not contain events,
157+
are there are no events in `base_population`, while `bar` will contain the events
158+
derived from `drug_dispenses_cohort`.
159+
160+
We plan to extend these manipulation in a near future to allow performing operations on
161+
subjects and events in a single line of code.
162+
163+
## CohortFlow
164+
`CohortFlow` objects can be used to track the evolution of a study population during the
165+
cohort design process. Let us assume that you have a `CohortCollection` containing
166+
`base_population`, `exposed`, `cases`, respectively containing the base population of
167+
your study, the subjects exposed to some drugs and their exposure events, the subjects
168+
having some disease and their disease events.
169+
170+
`CohortFlow` allows you to check changes in your population structure when while working
171+
on your cohort:
172+
173+
```python
174+
import matplotlib.pyplot as plt
175+
from scalpel.stats.patients import distribution_by_gender_age_bucket
176+
from scalpel.core.cohort_flow import CohortFlow
177+
178+
ordered_cohorts = [exposed, cases]
179+
180+
flow = CohortFlow(ordered_cohorts)
181+
# We use 'extract_patients' as the base population
182+
steps = flow.compute_steps(base_population)
183+
184+
for cohort in flow.steps:
185+
figure = plt.figure(figsize=(8, 4.5))
186+
distribution_by_gender_age_bucket(cohort=cohort, figure=figure)
187+
plt.show()
188+
```
189+
190+
In this example, `CohortFlow` computes iteratively the intersection between the base
191+
cohort (`base_population`) and the cohorts in `ordered_cohort`, resulting in three
192+
steps:
193+
194+
* `base_population` : all subjects
195+
* `base_population.intersection(exposed)` : exposed subjects
196+
* `base_population.intersection(exposed).intersection(cases)` : exposed subjects who
197+
are cases
198+
199+
Calling `distribution_by_gender_age_bucket` at each step allows us to track any change
200+
in demographics induced by restricting the subjects to the exposed cases.
201+
202+
Many more plotting and statistical logging available in `scalpel.stats` can be used the
203+
same way.
204+
205+
## Installation
206+
Clone this repo and add it to the `PYTHONPATH` to use it in scripts or notebooks. To add
207+
the library temporarily to your `PYTHONPATH`, just add
208+
209+
import sys
210+
sys.path.append('/path/to/the/SCALPEL-Analysis')
211+
212+
at the beginning of your scripts.
213+
214+
> **Important remark** : This software is currently in alpha stage. It should be fairly stable,
215+
> but the API might still change and the documentation is partial. We are currently doing our best
216+
> to improve documentation coverage as quickly as possible.
217+
218+
### Requirements
219+
220+
Python 3.6.5 or above and libraries listed in
221+
[requirements.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements.txt).
222+
223+
To create a virtual environment with `conda` and install the requirements, just run
224+
225+
conda create -n <env name> python=3.5.3
226+
pip install -r requirements.txt
227+
228+
## Citation
229+
230+
If you use a library part of _SCALPEL3_ in a scientific publication, we would appreciate citations. You can use the following bibtex entry:
231+
232+
@article{2019arXiv191007045,
233+
author = {{Bacry}, E and {Ga{\"{i}}ffas}, S. and {Leroy}, F. and {Morel}, M. and {Nguyen}, D. P. and {Sebiat}, Y. and {Sun}, D.}
234+
title = {{SCALPEL3: a scalable open-source library for healthcare claims databases}},
235+
journal = {ArXiv e-prints},
236+
eprint = {1910.07045},
237+
url = {http://arxiv.org/abs/1910.07045},
238+
year = 2019,
239+
month = oct
240+
}
241+
242+
## Contributing
23243
The development cycle is opinionated. Each time you commit, git will
24244
launch four checks before it allows you to finish your commit:
25-
1. Black: we encourage you to install it and integrate to your dev
26-
tool such as Pycharm. Check this [link](https://github.com/ambv/black). We massively encourage
27-
to use it with Pycharm as it will automatically
28-
2. Flake8: enforces some extra checks.
29-
3. Testing using Nosetests.
245+
1. We use [black](https://github.com/ambv/black) to format the code.
246+
We encourage you to install it and integrate to your code editor or IDE.
247+
2. Some extra checks are done using Flake8
248+
3. Testing with Nosetests
30249
4. Coverage checks if the minimum coverage is ensured.
31250

32-
After cloning, you have to run in the root of the repo:
251+
To activate the pre-commit hook, you just have to install the
252+
[requirements-dev.txt](https://github.com/X-DataInitiative/SCALPEL-Analysis/blob/master/requirements-dev.txt)
253+
dependencies and to run:
33254

34-
source activate exploration
35-
pre-commit install
255+
source activate <env name>
256+
cd SCALPEL-Analysis
257+
pre-commit install
258+
259+
To launch the tests, just run
260+
261+
cd SCALPEL-Analysis
262+
nosetests

scalpel/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# License: BSD 3 clause

scalpel/core/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# License: BSD 3 clause

0 commit comments

Comments
 (0)