Skip to content

Commit 7f3f6d4

Browse files
authored
481 document new data ingestion model based on fhir documentation and generate sandbox data (#486)
* New data ingestion model documentation added * sandbox generator examples added to data ingestion * updating comparator values to rates * sandbox generator added. preference and history read from ingestion data folder. config added for ingestion data folder. PractitionerRole.identifier used instead of PractitionerRole.practitioner.
1 parent 9a14795 commit 7f3f6d4

28 files changed

+16859
-1678
lines changed

.env.local

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,19 @@
1-
preferences=file:///path/to/knowledge-base/preferences.json
1+
default_preferences=file:///path/to/knowledge-base/preferences.json
22
mpm=/path/to/knowledge-base/prioritization_algorithms/motivational_potential_model.csv
3-
manifest=file:////path/to/knowledge-base/mpog_local_manifest.yaml
4-
log_level=DEBUG
5-
generate_image=0
6-
cache_image=0
7-
outputs=0
8-
plot_goal_line=1
9-
display_window=6
3+
manifest=file:///path/to/knowledge-base/manifest.yaml
4+
5+
log_level=INFO
6+
7+
# defaults
8+
# meas_period=1
9+
# log_level=WARNING
10+
# generate_image=1
11+
# cache_image=0
12+
# outputs=0
13+
# performance_month=None
14+
# use_mi=1
15+
# use_preferences=1
16+
# use_history=1
17+
# use_coachiness=1
18+
# plot_goal_line=1
19+
# display_window=6

.env.remote

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
11
# required knowledgebase paths
2-
mpm=https://raw.githubusercontent.com/Display-Lab/knowledge-base/1.7/prioritization_algorithms/motivational_potential_model.csv
3-
preferences=https://raw.githubusercontent.com/Display-Lab/knowledge-base/1.7/preferences.json
4-
manifest=https://raw.githubusercontent.com/Display-Lab/knowledge-base/refs/tags/1.7/mpog_manifest.yaml
2+
mpm=https://raw.githubusercontent.com/Display-Lab/knowledge-base-sandbox/1.0/prioritization_algorithms/motivational_potential_model.csv
3+
default_preferences=https://raw.githubusercontent.com/Display-Lab/knowledge-base-sandbox/1.0/preferences.json
4+
manifest=https://raw.githubusercontent.com/Display-Lab/knowledge-base-sandbox/refs/tags/1.0/manifest.yaml
5+
6+
log_level=INFO
57

68
# defaults
9+
# meas_period=1
710
# log_level=WARNING
811
# generate_image=1
912
# cache_image=0

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ outputs/*
1616
*.DS_Store
1717

1818

19-
*.csv
19+
2020
**/bin/__pycache__/
2121
python/.vscode/settings.json
2222
**/dist/

README.md

Lines changed: 32 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ cd scaffold
2727
**Using `venv` and `pip`**
2828

2929
```zsh
30-
python --version # make sure you have python 3.11
30+
python --version # make sure you have python 3.12
3131
python -m venv .venv
3232
```
3333

@@ -53,7 +53,7 @@ pip install uvicorn # not installed by default (needed for running locally)
5353
**Alternative: Using [Poetry](https://python-poetry.org/) (for developers)**
5454

5555
```zsh
56-
poetry env use 3.11 # optional, but makes sure you have python 3.11 available
56+
poetry env use 3.12 # optional, but makes sure you have python 3.12 available
5757
poetry install # creates a virtual environment and install dependencies
5858
poetry shell # activates the enviroment
5959
```
@@ -64,7 +64,7 @@ Clone the knowledge base repository in a separate location
6464

6565
```zsh
6666
cd ..
67-
git clone https://github.com/Display-Lab/knowledge-base.git
67+
git clone https://github.com/Display-Lab/knowledge-base-sandbox.git
6868
```
6969

7070
#### Running SCAFFOLD
@@ -75,10 +75,10 @@ Change back to the root of scaffold
7575
cd scaffold
7676
```
7777

78-
Update the `.env.local` file and change `path/to/knowledge-base` to point to the local knowledge base that you just checked out. (Don't remove the `file://` for default_preferences and manifest.)
78+
Create a copy of the `.env.local` file and call it `.env.dev` and update it by changing `path/to/knowledge-base` to point to the local knowledge base that you just checked out. (Don't remove the `file://` for default_preferences and manifest.)
7979

8080
```properties
81-
# .env.local
81+
# .env.dev
8282
default_preferences=file:///Users/bob/knowledge-base/preferences.json
8383
mpm=/Users/bob/knowledge-base/prioritization_algorithms/motivational_potential_model.csv
8484
manifest=file:///Users/bob/knowledge-base/mpog_local_manifest.yaml
@@ -89,7 +89,7 @@ manifest=file:///Users/bob/knowledge-base/mpog_local_manifest.yaml
8989
There are two different ways to run SCAFFOLD API:
9090
1. Run SCAFFOLD API using uvicorn
9191
```zsh
92-
ENV_PATH=.env.local uvicorn scaffold.api:app
92+
ENV_PATH=.env.dev uvicorn scaffold.api:app
9393
# Expect to see a server start message like this "INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)"
9494
```
9595

@@ -118,24 +118,43 @@ ENV_PATH=/user/.../dev.env pipeline batch '/path/to/input/folder/' --max-files 5
118118
Use --max-files if you need to limit the number of files to process.
119119

120120
##### Run SCAFFOLD CLI with CSV inputs
121-
First install the python app. Then update the `.env.local` file and add links to history and preferences csv files along with other parameters mentioned earlier (manifest, default_preferences and mpm).
121+
First install the python app. Then create the `.env.dev` file as mentioned above.
122122

123123
```properties
124-
# .env.local
124+
# .env.dev
125125
default_preferences=file:///Users/bob/knowledge-base/preferences.json
126126
mpm=/Users/bob/knowledge-base/prioritization_algorithms/motivational_potential_model.csv
127127
manifest=file:///Users/bob/knowledge-base/mpog_local_manifest.yaml
128-
preferences=/Users/bob/data/preferences.csv
129-
history=/Users/bob/data/history.csv
130128
...
131129
```
132130
Then use the following command to run the pipeline passing performance data csv file
133131

134132
```zsh
135-
ENV_PATH=/user/.../dev.env pipeline batch_csv '/path/to/performance/data/file.csv' --performance-month {performance month i.e. 2024-05-01} --max-files 500
133+
ENV_PATH=/user/.../dev.env python -m scaffold.cli batch-csv '/path/to/performance/data/folder' --performance-month {performance month i.e. 2025-05-01} --max-files 500
136134
```
137-
Use --performance-month to set the performance month for batch_csv command and optional --max-files to limit the cases to process for development .
138135

136+
Alternatively, you can use pip to install the pipeline command and use it to run the pipeline. Use the following command in the root of repository to install SCAFFOLF
137+
138+
```zsh
139+
pip install .
140+
```
141+
142+
Then you can use the following command to run the pipeline
143+
```zsh
144+
ENV_PATH=/user/.../dev.env pipeline batch-csv '/path/to/performance/data/folder' --performance-month {performance month i.e. 2025-05-01} --max-files 500
145+
```
146+
147+
Alternatively, if you have poetry installed, you can run
148+
```zsh
149+
poetry install
150+
```
151+
152+
and then you shpuld be able to use the folloiwng command to run the pipeline:
153+
154+
```zsh
155+
ENV_PATH=/user/.../dev.env pipeline batch_csv '/path/to/performance/data/folder' --performance-month {performance month i.e. 2025-05-01} --max-files 500
156+
```
157+
Use --performance-month to set the performance month for batch_csv command and optional --max-files to limit the cases to process for development.
139158

140159
## Environment variables
141160

@@ -145,7 +164,7 @@ Local file path or URL (see .env.remote for github URL formats). All are require
145164

146165
#### mpm: Path to the mpm csv file
147166

148-
#### preferences: Path to the preferences json file
167+
#### default_preferences: Path to the default preferences json file
149168

150169
#### manifest: Path to the manifest file that includes differend pieces of the base graph including (causal pathways, message templates, measures and comparators). See [manifest configuration](#manifest-configuration) for more detail
151170

data ingestion model/README.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# SCAFFOLD [Data Ingestion Model](https://docs.google.com/spreadsheets/d/1qDjS2-a7F1El53jUx0fippL3m28pQilLcYNAv4pQkxI/edit?gid=1258033503#gid=1258033503)
2+
We adopted several [FHIR standard](https://hl7.org/fhir/index.html)'s resources to model data ingestion for SCAFFOLD. The input data include:
3+
- Provider information
4+
- Performance data
5+
- Comparator data
6+
- Message history
7+
- Preference
8+
9+
The data was structured using FHIR resources to the extent possible. Since no suitable resources exist for history and preferences, those were represented using our own format.
10+
11+
## Data Structure
12+
### Provider information (`PractitionerRole`)
13+
The [`PractitionerRole`](https://build.fhir.org/practitionerrole.html) resource is used to represent message recipients(individuals or organizations), their relationships and their roles. The input data include a table (PractitionerRole.csv) with the following columns:
14+
- **[PractitionerRole.identifier](https://build.fhir.org/practitionerrole-definitions.html#PractitionerRole.identifier)**: Unique identifier for each row in the Practitiner table. This identifier links performance data, history and preferences to each recipient. In those datasets, PractitionerRole.identifier is refered to as subject.
15+
- **[PractitionerRole.practitioner](https://build.fhir.org/practitionerrole-definitions.html#PractitionerRole.practitioner)**: Contains the practitioner identifier. If this column has a value, the row represents an individual practitioner otherwise it is aggregate data for a group for example a hospital.
16+
- **[PractitionerRole.organization](https://build.fhir.org/practitionerrole-definitions.html#PractitionerRole.organization)**: Contains the identifier of the institution where the recipient serves. This field, together with `PractitionerRole.code` is used to identify the comparator data associated with each recipient.
17+
18+
- **[PractitionerRole.code](https://build.fhir.org/practitionerrole-definitions.html#PractitionerRole.code)**: Contains the role of the recipient in the institution. Example values for this field could be `Resident`, `Attending` or `CRNA`.
19+
- **type**: Indicates whether the performance data belong to an individual provider or to a group of providers. Accordingly, a `PractitionerRole` may represent either a single provider or a group. This field is not part of the FHIR `PractitionerRole` resource; in our model, it is introduced to classify `PractitionerRole` as either individual or group, allowing us to distinguish between the two types of performance data. Example values include `Practitioner` and `Organization`.
20+
21+
### Performance data (`MeasureReport`)
22+
Performance data are modeled using the [`MeasureReport`](https://build.fhir.org/measurereport.html) resource, which represents the results of a measure evaluation. In SCAFFOLD, each row of performance data is modled as a measure report. Accordingly, the input data include a table (PerformanceMeasureReport.csv) with the following columns:
23+
- **[identifier](https://build.fhir.org/measurereport-definitions.html#MeasureReport.identifier)**: Uniquely identifies a specific performance data record.
24+
- **[measure](https://build.fhir.org/measurereport-definitions.html#MeasureReport.measure)**:
25+
A reference to the measure with which the performance record is associated.
26+
- **[subject](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.subject)**: Contains the recipient's unique identifier.
27+
- **[period.start](https://hl7.org/fhir/datatypes-definitions.html#Period.start)**: The start date of the period for which the performance record was collected.
28+
- **[period.end](https://hl7.org/fhir/datatypes-definitions.html#Period.end)**: The end date of the period for which the performance record was collected.
29+
- **[measureScore](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.measureScore_x_).rate**: The calculated success rate for the performance record.
30+
- **[measureScore](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.measureScore_x_).denominator**: The total number of cases on which the performance record is based.
31+
- **[measureScore](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.measureScore_x_).range**: Used for categorical values.
32+
33+
### Comparator data (`MeasureReport`):
34+
Comparator data , which represent aggregated performance for a selected group of recipients, are modeled using the [`MeasureReport`](https://build.fhir.org/measurereport.html) resource. In SCAFFOLD, each row of comparator data is modeled as a measure report. Accordingly, the input data include a table (ComparatorMeasureReport.csv) with the following columns:
35+
- **[identifier](https://build.fhir.org/measurereport-definitions.html#MeasureReport.identifier)**: Uniquely identifies a specific comparator data record.
36+
- **[measure](https://build.fhir.org/measurereport-definitions.html#MeasureReport.measure)**: A reference to the measure with which the comparator record is associated.
37+
- **[group.subject](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.subject)**: The identifier of the organization for which the comparator record is calculated. This column is equivalent to the `PractitionerRole.organization` column in the Provider Information data set.
38+
- **[period.start](https://hl7.org/fhir/datatypes-definitions.html#Period.start)**: The start date of the period for which the comparator record was collected.
39+
- **[period.end](https://hl7.org/fhir/datatypes-definitions.html#Period.end)**: The end date of the period for which the comparator record was collected.
40+
- **[measureScore](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.measureScore_x_).rate**: The average calculated success rate for the performance records of the selected group of providers.
41+
- **[measureScore](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.measureScore_x_).denominator**: The total number of cases on which the comparator record is based.
42+
- **[group.code](https://hl7.org/fhir/measurereport-definitions.html#MeasureReport.group.code)**: Specifies the type of comparator represented in each comparator record. Example values for this field include `peer average`, `Peer Top 10%` or `Goal Value`.
43+
- **[PractitionerRole.code](https://build.fhir.org/practitionerrole-definitions.html#PractitionerRole.code)**: Indicates the role of the providers for whom the comparator record is calculated. Example values for this field include `Resident`, `Attending` or `CRNA`.
44+
45+
### Message history
46+
Message history captures previously generated messages over defined time periods. SCAFFOLD's input data include a table (MessageHistory.csv) with the following columns:
47+
- **subject**: The provider (practitioner) identifier, to whom the message history record belongs.
48+
- **period.start**: The start date of the period for which the message was created.
49+
- **period.end**: The end date of the period for which the message was created.
50+
- **history.json**: A JSON dictionary summarizing the generated message, with the following keys:
51+
- **message_template**: The identifier of the message template used to generate the message.
52+
- **message_template_name**: The name of the message template used to generate the message.
53+
- **message_generated_datetime**: The date and time the message was generated.
54+
- **measure**: The measure associated with the generated message.
55+
- **acceptable_by**: The causal pathway associated with the generated message.
56+
57+
### Preference
58+
Preferences captures providers' choices, priorities, and settings for messages that are generated for them. SCAFFOLD's input data include a table (Preferences.csv) with the following columns:
59+
- **subject**: The provider (practitioner) identifier, to whom the preferences record belongs.
60+
- **preferences.json**: A JSON dictionary with preferences detail. Here is an example of preferences JSON which SCAFFOLD can currently use
61+
```json
62+
{
63+
"Utilities": {
64+
"Message_Format": {
65+
"Social gain": "0.04",
66+
"Social stayed better":"-0.08",
67+
"Worsening": "-0.1",
68+
"Improving": "-0.11",
69+
"Social loss": "0.69",
70+
"Social stayed worse": "-0.54",
71+
"Social better": "-1.23",
72+
"Social worse": "0.54",
73+
"Social approach": "1.0",
74+
"Goal gain": "0.07",
75+
"Goal approach": "1.1"
76+
},
77+
"Display_Format": {
78+
"Bar chart": 1,
79+
"Line chart": 0,
80+
"Text-only": 0,
81+
"System-generated": 0
82+
}
83+
}
84+
}
85+
```
86+
## Data Generator
87+
First, create a folder for new data (i.e. `new_data`).
88+
If the data is going to be generated for individual recipients, create a config.json inside the new data folder containing
89+
```json
90+
{
91+
"ComparatorMergeColumns":["group.subject", "PractitionerRole.code"]
92+
}
93+
```
94+
95+
for hospital level data use
96+
```json
97+
{
98+
"ComparatorMergeColumns":["PractitionerRole.code"]
99+
}
100+
```
101+
102+
Now you can run the scripts sequentially to generate data.
103+
104+
For example to generate hospital level data for 100 hospitals run the following commands:
105+
106+
```zsh
107+
python data\ ingestion\ model/sandbox\ generator/PractitionerRole_hospital_level.py --num_orgs 100 --path new_data
108+
109+
python data\ ingestion\ model/sandbox\ generator/PerformanceMeasureReport.py --path new_data
110+
111+
python data\ ingestion\ model/sandbox\ generator/ComparatorMeasureReport.py --path new_data
112+
113+
python data\ ingestion\ model/sandbox\ generator/Preference.py --path new_data
114+
115+
ENV_PATH=/Path/to/your/environment/file/dev.env python data\ ingestion\ model/sandbox\ generator/MessageHistory.py --path new_data
116+
```
117+
118+
This will start by creatig the list of hospitals in PractitionerRole.csv file. Then will generate performance data in PerformanceMeasureReports.csv. Next step will create the comparator data in ComparatorMeasureReport.csv. Then the preferences will be added to preferences.csv. Finall step will use SCAFFOLD to generate the history of messages generated by pipeline for the months before the performance month.
119+
120+
## Example Data
121+
Sandbox hospital level example data is generated for 100 hospitals and included at 'sandbox examples' folder. This folder includes
122+
- PractitionerRole.csv, which contains hospital definitions
123+
- PerformanceMeaasureReport, which contains performance data for each hospital on 12 defined measures in sandbox knowledge base for 12 month.
124+
- config.json, which is required to find the right comparator for each recipient
125+
- ComparatorMeasureReport.csv, which contains the comparator data for based on the entire network for each measure, for each month.
126+
- Preferences.csv, which includes preferences for a small subgroup of recipients.
127+
- MessageHistory.csv, which includes history of generated messages for 11 month before the performance month.
128+
129+
# Run SCAFFOLD
130+
To run SCAFFOLD on sandbox data you need to prepare the environment and install SCAFFOLD. For more detail, follow the `Quick start` section of the [main SCAFFOLD documentation page](../README.md). Skip `Run SCAFFOLD API` and `Run SCAFFOLD CLI with JSON inputs` sections and continue with `Run SCAFFOLD CLI with CSV inputs`.
131+
132+

0 commit comments

Comments
 (0)